**Advantages and Limitations of Naturalistic Study Designs and Their Implementation in Alcohol Hangover Research**

**Joris C. Verster 1,2,3,\*, Aurora J. A. E. van de Loo 1,2, Sally Adams 4, Ann-Kathrin Stock 5, Sarah Benson 3, Andrew Scholey 3, Chris Alford <sup>6</sup> and Gillian Bruce <sup>7</sup>**


Received: 2 November 2019; Accepted: 5 December 2019; Published: 6 December 2019

**Abstract:** In alcohol hangover research, both naturalistic designs and randomized controlled trials (RCTs) are successfully employed to study the causes, consequences, and treatments of hangovers. Although increasingly applied in both social sciences and medical research, the suitability of naturalistic study designs remains a topic of debate. In both types of study design, screening participants and conducting assessments on-site (e.g., psychometric tests, questionnaires, and biomarker assessments) are usually equally rigorous and follow the same standard operating procedures. However, they differ in the levels of monitoring and restrictions imposed on behaviors of participants before the assessments are conducted (e.g., drinking behaviors resulting in the next day hangover). These behaviors are highly controlled in RCTs and uncontrolled in naturalistic studies. As a result, the largest difference between naturalistic studies and RCTs is their ecological validity, which is usually significantly lower for RCTs and (related to that) the degree of standardization of experimental intervention, which is usually significantly higher for RCTs. In this paper, we specifically discuss the application of naturalistic study designs and RCTs in hangover research. It is debated whether it is necessary to control certain behaviors that precede the hangover state when the aim of a study is to examine the effects of the hangover state itself. If the preceding factors and behaviors are not in the focus of the research question, a naturalistic study design should be preferred whenever one aims to better mimic or understand real-life situations in experimental/intervention studies. Furthermore, to improve the level of control in naturalistic studies, mobile technology can be applied to provide more continuous and objective real-time data, without investigators interfering with participant behaviors or the lab environment impacting on the subjective state. However, for other studies, it may be essential that certain behaviors are strictly controlled. It is, for example, vital that both test days are comparable in terms of consumed alcohol and achieved hangover severity levels when comparing the efficacy and safety of a hangover treatment with a placebo treatment day. This is best accomplished with the help of a highly controlled RCT design.

**Keywords:** study design; naturalistic study; randomized controlled trial; alcohol; hangover; blinding; mobile technology

#### **1. Introduction**

The alcohol hangover is defined as a combination of mental and physical symptoms, experienced the day after a single episode of heavy drinking, starting when the blood alcohol concentration approaches 0 [1]. Studies in this research area examine the causes, functional consequences, and potential treatments of the next day (i.e., post-intoxication) effects of alcohol consumption. The alcohol hangover is associated with cognitive and psychomotor impairment [2] and mood changes [3], and may negatively affect daily activities, such as driving a car [4,5] or job performance [6]. The World Health Organization (WHO) estimates that 5.1% of the global burden of disease and injury is attributable to alcohol use and its consequences [7], and a recent UK study rated the economic costs of having hangovers in terms of absenteeism and presenteeism at 4 billion GBP per year [8]. Despite this, the pathology of the alcohol hangover is poorly understood [9,10], and although there is great market demand [11], there are currently no effective hangover treatments available [12].

Both randomized controlled trials (RCTs) and naturalistic study designs are commonly applied in hangover research. Although increasingly applied in social sciences and medical research, the suitability of using naturalistic study designs remains a topic of debate. To examine this, our paper compares the naturalistic study design with the traditional controlled experimental design, in particular RCTs. It discusses the advantages and disadvantages of both designs and suggests solutions for issues of concern.

Traditionally, medical science has been based on clinical observations of patients and control samples. In the fields of psychiatry and psychology, for example, participants either self-report their mood or an investigator observes their behavior. This was common practice before the introduction of RCTs. However, since their introduction, the quality, methodology, and reporting of medical science has been continuously optimized [13], and the RCT is, therefore, currently often viewed as the gold standard that allows for the most precise and systematic investigations. RCTs are, for example, commonly used to investigate the efficacy and safety of a medicinal drug in a specific patient population. The RCT design is characterized by having several inclusions, exclusion, and discontinuation criteria that apply to participants, including lifestyle rules with regard to, for example, alcohol and drug use and smoking. RCTs are ideally double or triple blind to avoid influencing the study outcome, and participants are randomly allocated to treatment conditions. The treatment order is varied (cross-over) to account for any learning or order effects. All study-related activities are highly standardized and conducted per protocol, with the aim to have all test days as identical to each other as possible. In theory, the only methodological difference between the test days is the administered treatment or intervention. This way, it is thought that the study gathers 'clean' data about the effect of the treatment or intervention. However, this level of control comes at the cost of RCTs creating highly artificial situations, which lack ecological validity and/or potentially differ from the effects observed in the participants' everyday life.

On the other hand, the aim of the naturalistic study design is to mimic real-life as closely as possible, and as such is characterized by a minimum of lifestyle rules for participants, in which the investigators do not (actively) interfere with their activities. Hence, several behaviors and activities of the participants are not standardized and not regulated by a study protocol. Participants continue their normal lives and may visit the testing site for assessments or bio-sample collection or may even be able to undertake these assessments whilst remaining in their usual environment. Commonly, the only instruction is to behave normally (e.g., take their medication as prescribed or drink alcohol as they would on a normal night out), complete scheduled assessments (e.g., a sleep diary or online scales), and visit the testing site at set times.

The naturalistic design is increasingly utilized in various research areas and has been successfully applied in phase III studies and pharmacovigilance research, e.g., to investigate the efficacy of

antipsychotics in schizophrenia patients [14] or breast cancer patients [15]. The following sections will discuss the commonalities and differences between RCTs and naturalistic study designs, advantages and disadvantages, and possible solutions to common pitfalls.

#### **2. Recruitment, Screening, and Test Days**

Both RCTs and naturalistic studies have highly controlled data collection on test days. This includes conducting standardized and validated tests according to good clinical practice (GCP) and utilizing standard operating procedures at pre-set times specified in the study protocol. Furthermore, both study designs can have various lifestyle rules (e.g., no alcohol or drug use, no smoking), which can be verified by objective assessments on the test day. In this respect, naturalistic studies do usually not differ from RCTs.

Recruitment, screening, selecting, and training of participants can also be equally rigorous in RCTs and naturalistic studies. Both study designs can apply the same inclusion and exclusion criteria. Objective assessments can be conducted to verify the criteria (e.g., blood chemistry, urinalysis, and electrocardiography), and participants can be familiarized with and trained in completing psychometric tests, treatment administration, and completing mood scales. The main reason that rigorous screening and selection of study participants in RCTs is common is that it ensures a more homogenous study sample. It is expected that there will be more variability between study participants in responsiveness to the administered treatments when the eligibility criteria are loosened. Loosening eligibility criteria may then decrease the chances of successfully demonstrating efficacy or safety. To demonstrate the true drug effect, assessments should not be obscured by various external uncontrolled factors. Unfortunately, applying a large number of eligibility criteria usually results in a considerable number of screening failures (i.e., participants not meeting all criteria for participation) or drop-outs and compliance failures (i.e., participants discontinuing or failing to adhere to the study protocol). This is commonly seen in RCTs [16–18]. In addition, a number of people may not participate in the first place when they are informed about the strict lifestyle rules and the hassle of screening procedures (e.g., blood drawings and medical examinations). Unfortunately, this may induce a (self-)selection bias in the study sample.

The extent to which RCT participants in drug development are representative for the patient population can therefore be questioned [17,18]. While some 'safety-related' eligibility criteria are obviously necessary, other eligibility criteria (e.g., cut off values for body weight ranges) are often not strongly justified by supporting scientific evidence [16]. Not applying or loosening unjustified eligibility criteria will increase recruitment speed and result in a study sample that better reflects the entire patient population. Some recent RCTs have, therefore, included a 'real life' arm in their study, including participants who did not meet the stringent eligibility criteria of the RCT [19]. As naturalistic studies aim to mimic real life, eligibility criteria are often less strict than those applied in RCTs. This may significantly increase the ecological validity of the study, which is usually low in RCTs [14].

#### **3. Level of Control, Supervision, and Monitoring**

All RCT study-related activities are closely monitored at the testing site (e.g., clinic or lab). However, this is not always the case in naturalistic study designs, in which researchers are not necessarily present.

One issue is not reporting behavior. As participation in research studies is typically confidential, and sometimes anonymous, there should be no objective reason for participants not to report certain behaviors. However, if these behaviors are restricted by discontinuation criteria, participants may decide not to report them in order to prevent themselves from being excluded from further study participation. Another reason could be social desirability, as participants may be less likely to report behaviors or incidents that they either perceived to be detrimental to their self-image or that they fear may result in negative judgement from others. Another issue may be misreporting. Participants may not report certain behaviors simply because they were not asked about them (e.g., a researcher refrains from questioning participants about drug use, because an inclusion criterion to participate in the study was not using drugs), or they view these behaviors as irrelevant to the study (e.g., a participant being unaware that drinking a cup of coffee can improve subsequent cognitive test performance). Fortunately, there are several ways to retrospectively and objectively verify the occurrence of study-relevant behaviors, including assessments for residual alcohol use (breathalyzer), drug use (urine tests), and recent smoking (exhaled carbon monoxide), or monitor activity and sleep episodes (actigraphy).

In both naturalistic studies and RCTs, it is also increasingly common to implement ambulatory assessments in the study design, for example cognitive tests or questionnaires completed online/at home. The advantage of not having to schedule visits to the testing site makes it easier to participate in the study and thus reduces chances of dropouts. It also allows for repeated testing at fixed time intervals, which may help to reduce the risk of study-relevant events not being recalled correctly. At home, testing has been successfully implemented in numerous phase III studies, using the same tests that would have been conducted in the clinic (e.g., online cognitive tests, blood pressure assessment, or self-administered blood glucose tests). In short, the use of mobile technologies enable compliance monitoring. Furthermore, mobile technology, home testing, and the internet provide various ways to ensure valid and reliable real-time assessments of cognitive and physical functioning, mood, and biomarkers [20–23].

However, in naturalistic studies, assessments are often limited to retrospective and subjective self-reports. When relying entirely on self-reports, recall bias and memory loss may have a significant impact on the accuracy of the collected data. For example, research has shown that people under- or over-estimate the amount of alcohol consumed [22,23] and that subjective and objective assessment of sleep parameters are not always in concordance with each other [23]. The latter should be taken into account when interpreting the data obtained in naturalistic studies.

To prevent the presence of observers/researchers from influencing the behaviors of study participants, one could consider monitoring the subject's behaviors in real time via video streaming, without the awareness of study participants that they are being filmed. However, this approach would raise ethical, privacy, and data security concerns. A better alternative to this would be to apply mobile technology to objectively measure behaviors, including parallel objective measures to help triangulate data obtained from other measures.

Activity, sleep, and physiological parameters, such as heartrate and body temperature, can, for example, all be measured in real time using activity watches or 'wearables'. Behavioral and mood data can be collected by real-time self-reports via smartphone apps (e.g., entering every drink they consumed). Alternatively, wearable technology (watches) that may record transdermal alcohol concentrations are currently being developed. In the future, these devices could be used to complement or partly replace self-reports. Moreover, they could help to reduce drop-out rates as a number of "passive" measurements could be conducted without requiring any effort from the participants. Importantly, this would also help to obtain a more complete picture in studies that investigate aversive effects, such as a hangover, which might lead to systematic drop-outs on the more severe end of the symptom scale. Taken together, mobile technology would not only reduce the strain on study participants, but potentially also make the measurements more objective. In addition, test batteries used in RCTs are often administered as single assessments or, at best infrequently. These can therefore easily miss critical events or periods. Mobile data collection can include participant actioned recording of events and more regular testing, or continuous psychophysiological assessments, including wearable devices, which can all provide a better picture of participant behavior and subjective state.

As part of mobile testing, conducting an online survey is another common way to collect data from participants. This is effective if the subject sample is large or if it is not necessary or possible for participants to visit the research facility (e.g., due to obstacles, such as bad weather, large distances, or physiological constraints). While online methodologies are an easy way to collect data, there are several disadvantages. For example, the researcher cannot be certain whether the scheduled participant is completing the survey or whether someone else is doing it in their place. Furthermore, the condition of the participant cannot be verified by the researcher (e.g., they might be drunk or drugged while

completing the survey or may not be giving the assessment their full attention), which may reduce the accuracy and validity of the resulting data. Further enhancing this methodology can increase reliability of the collected data, for example by video streaming. Video streaming can confirm if the scheduled participant is actually present and can verify how the participant conducts a test or completes questionnaires. It further enables the researcher to observe the general health and makes it possible to record real-time observer-rated adverse effects.

#### **4. Level of Standardization of Tests and Procedures**

While the scrutiny of recruitment, screening, and test day assessments can have comparable levels of control and standardization in RCTs and naturalistic studies, the designs differ significantly with regard to the standardization and activities of participants during the intervention phase. In RCTs, every activity of the participant takes place in the testing facility. Activities are scheduled at pre-set times and conducted according to standard procedures. This includes treatment administration, meals, activities, time going to bed, or the environment where participants spend time (i.e., the testing site). Moreover, all assessments and activities are standardized and precisely monitored and recorded by the researchers. The rationale to conduct an RCT in this way is clear: By minimizing the non-intervention-related variability (i.e., the uncontrolled "noise") in all potentially study-relevant parameters, the chance of observing a true treatment effect increases.

In contrast, in naturalistic studies, participants continue with their usual activities and researchers do not observe them or provide instructions on how to behave. Thus, the researchers do not interfere with the participants' activities. Consequently, behaviors are unstandardized and self-initiated. The rationale for this approach is to closely mimic real life, i.e., to maximize ecological validity. This ecological validity is important because it best reflects the way in which phenomena, such as hangovers, emerge, and medicinal treatments will be actually used when marketed. Additionally, eligibility criteria in naturalistic studies may be less strict compared to those of RCTs to ensure the study sample better reflects the heterogeneous population who will use a treatment or intervention in clinical practice and provide a better picture of efficacy. Thus, rather than a limitation, the lack of standardization can be considered to be a benefit of the naturalistic study design.

A related discussion is the use of subjective versus objective assessments and the quest for the inclusion of biomarker assessments in a study. Cytokine concentrations, for example, can vary in cases of depression [24] or during the hangover state [25]. It can thus be interesting to assess cytokine changes in blood or saliva. The alcohol hangover state is a subjective experience which, up till now, cannot be objectively measured. Although this can be viewed as a significant limitation of this research area, it should be underlined that biomarkers are per definition (at best) proxy-measures if one aims to measure mood or how the participant feels. Clinical observations may be an alternative, but these usually do not substitute for subjective assessments of the severity or nature of mood states. To date, the best way to rate mood levels is by asking participants to report how they feel [26]. Interestingly, in this regard, the outcome of these subjective assessments is not always in correspondence with the outcome of objective biomarker assessments. Participants can, for example, report feeling perfectly fine while having a clinically relevant increase in blood pressure. Alternatively, participants can report sleep complaints and poor sleep quality while their polysomnographic outcomes are within normal ranges. Together, these findings advocate to include both subjective and objective assessments in future studies, irrespective of whether the study design is RCT or naturalistic.

#### **5. Implications for Hangover Research**

To provoke the hangover state, an evening of supervised alcohol consumption is typically scheduled in RCTs. The amount and type of alcoholic drink (and placebo) and the pace of drinking are usually pre-defined, and drinking is conducted within a pre-set time frame. This is typically conducted in a clinical setting, often accompanied by other participants who do not know each other. Food and other beverage intake (e.g., water) are prohibited or controlled, as are the cognitive and physical

activities of the participants. All activities are closely monitored and recorded by the researchers, including blood alcohol concentration (BAC) assessments to verify alcohol consumption levels and adverse event recording. The evening activities are often concluded by a night of supervised sleep in the clinic, with a pre-set bed-time and wake-up time. Sleep quality and duration can be monitored with polysomnography or study personnel.

In contrast, in naturalistic studies, participants drink in a familiar setting (e.g., a bar or at home) with people they know, engaging in their usual activities. These normally differ from activities employed in RCTs (e.g., dancing in a club versus reading a magazine in the laboratory). In naturalistic studies, participants can eat food when they feel hungry and smoke and are exposed to external stimuli which are not replicated in the RCT setting (e.g., visiting multiple bars, walking outside in the rain, waiting for a bus to travel home). They can go to bed when feeling sleepy without being restricted by study procedures, which often dictate a much earlier time-to-bed than people have in real life after an evening out. As they sleep in their own beds, they will not experience the sleep problems that are common in RCTs, in which participants sleep in a new and unknown clinical environment (e.g., the first night effect) [27,28]. In addition, participants can apply their personal sleep habits, sleep hygiene activities, and wake-up rituals in naturalistic studies. Finally, socializing, expectancies, and motives for alcohol consumption most likely differ between real-life situations and RCTs and may impact assessment outcomes. Thus, in naturalistic studies, participants can either drink alone or have an evening with friends in a setting of their own choice. Bedtime is self-initiated, and participants sleep at home in their own bed. The next morning, participants come to the testing site for the assessments on the test day. Past evening behaviors are recorded retrospectively (e.g., via questionnaires or an interview), and in case of mobile technology use, objective data read-outs are obtained from the devices.

Whether or not it is important to monitor the drinking session depends entirely on the aim of individual research projects. For some studies, it may be essential that certain behaviors are strictly controlled. For example, when comparing the efficacy and safety of a hangover treatment with a placebo treatment, it is vital that both test days are comparable, in terms of consumed alcohol and achieved hangover severity levels. In this case, a strictly controlled RCT design would be favorable. If one chooses to use a naturalistic study design in efficacy studies, a statistical analysis should account for differences between the test days (e.g., in the form of co-variates or propensity scores). However, it is not always possible to accurately account for all variables. This could, for example, be because they depend on subjective self-reports (e.g., alcohol intake), because certain information is lacking (e.g., congener content of drinks), or because a certain factor has not (yet) been recognized as relevant (e.g., a certain genotype or developmental factors). In summary, several important factors that differ between test days (e.g., certain behaviors) that may bias the comparison between treatment and placebo will likely remain unknown or unrecognized and, therefore, not properly accounted for.

On the other hand, if one is primarily interested in the effects of the (subjective) alcohol hangover itself on cognitive performance, mood, or other variables, then the behaviors that provoked the hangover state are of limited importance. In this case, there is no clear need to monitor the amount and type of alcohol consumed, estimated peak BAC, and the setting and behaviors during the drinking session. In extremis, participants could then be recruited in the morning after an evening out and allocated to a hangover or control group, or groups that consumed alcohol or not. This would be the ultimate way of not interfering with participant drinking behavior, as participants were unaware that they were going to participate in a research study at the time they displayed the study-relevant behavior (e.g., drinking or staying sober). This design was successfully applied by Devenney et al. [29], who recruited participants at university venues in the morning, i.e., on the day following the drinking session. However, if one is interested in how drinking variables and behaviors during the drinking session cause or relate to hangover variables, it is essential that these are accurately measured. Statistical analysis can then take into account the observed interindividual differences in naturalistic studies.

There are obvious advantages of applying a naturalistic study design in alcohol hangover research, as the drinking session reflects what people do in normal life. In contrast to RCTs, they are not forced

to adapt to a drinking regime, including consuming alcoholic beverages that are not their regular choice during a pre-set drinking time period that may differ from a normal night out. In fact, research consistently shows that in real life situations, most people consume much larger quantities of alcohol over a longer period of time, as compared to the pre-set dosages of alcohol that are administered in clinical studies to provoke a hangover. This results in significantly higher (and more realistic) BAC levels in naturalistic studies, as compared to many RCTs [30].

Assessments during the hangover state can then take place in the clinic, following a highly standardized and controlled protocol, similar to RCTs. Alternatively, Scholey et al. [31] utilized online cognitive testing in a naturalistic hangover study and demonstrated that this was an effective way to collect objective data in real time during the hangover state. This study also addressed the issue of participant drop out. It has been argued that participants who experience severe events may not continue participation in naturalistic studies. This would of course bias the study outcome in favor of a treatment. Scholey et al. [31] compared their study participants with their dropouts. For both groups, peak BAC was assessed in real time the evening before the (hangover) test day, and no significant BAC difference was observed between participants who did and did not complete the test day assessments. Hence, there are presumably other reasons than mere degree of intoxication that determine whether participants discontinue study participation or not. A different approach has been the use of mobile technology, including screen-based tests, to enable participants to be assessed within the privacy and safety of their own homes, without the need to travel to the test center when hungover, avoiding dropouts [32].

Finally, studies comprising alcohol administration to humans usually require ethics approval. For many ethics committees, it appears that a noteworthy difference is made based on whether the alcohol is actually administered to participants by the experimenters (RCTs) or whether they administer it themselves in an unsupervised setting (naturalistic studies). Ethics committees often limit the amount of alcohol researchers are allowed to administer to participants of RCTs to a blood alcohol concentration (BAC) below 0.12%, while in study protocols for naturalistic studies it is unknown how much alcohol participants will consume. Naturalistic studies consistently demonstrate that actual drinking levels are associated with much higher BACs. For example, Hogewoning et al. [30] reported an estimated BAC of 0.2%. When interviewing naturalistic study participants, they attest that they had a 'normal' night out, including their usual drinking behavior. This is an odd situation considering that, in RCTs, alcohol consumption is closely monitored with a physician and study personnel present, while participants can drink alcohol freely and unsupervised in naturalistic studies. Monitoring the level of alcohol consumed will also aid in evaluating hangover treatments. True symptom levels may not be assessed in the laboratory due to alcohol dosing restrictions, where effectively only 'sub-clinical' hangover symptom levels are evaluated.

Of note, viewpoints and safety concerns of ethics committee members are not always in agreement with those of study participants. For example, Petrie et al. [33] investigated the stress and pressure/imposition experienced by RCT participants for a variety of study-related handlings (e.g., blood pressure assessment, blood drawing) and compared their ratings to those of ethics committee members. The study revealed that several commonly applied procedures, such as taking a saliva sample or completing a questionnaire or mood scale, were rated as significantly less stressful by RCT participants compared to the ratings anticipated by ethics board members.

Petrie et al. [33] also compared the experienced stress levels in RCT procedures with those experienced in daily life and found that many relatively harmless experiences (e.g., stress when 'asked to donate to a charity in the street' or being 'caught in the rain') were rated as more stressful by study participants than completing a mood scale or delivering a saliva or urine sample. The overall conclusion of the study was that study-related stress and the impact of procedures in the standardized data collection may be overestimated by some ethics committees. Unfortunately, the restrictions that ethics committees feel inclined to impose upon proposed research projects (especially RCTs) can have a significant impact on the ecological validity of these studies and the consequential validity of the findings.

#### **6. Concluding Remarks**

The commonalities and differences between RCT designs and naturalistic studies are summarized in Table 1.

**Table 1.** Commonalities and differences between randomized controlled trial (RCT) and naturalistic study designs.


Description of validity types: Ecological validity = to what extend the study reflects a realistic hangover drinking occasion; external validity = to what extent can findings be generalized to the population as a whole; internal validity = to what extent can the design demonstrate causal effects; criterion validity = to what extent are measures related to study outcomes; construct validity = the degree to which the administered tests measure what they claim or purport to be measuring. Abbreviation: Blood alcohol concentration (BAC). Please note that this table is intended to contrast the RCT and naturalistic study design. Some studies might incorporate features of both designs (e.g., supervised and standardized alcohol administration, but unsupervised sleep at home). Additionally, studies with the same design type may differ significantly in the levels of control, standardization, and quality.

RCT designs are preferred for studies that require strictly-controlled study procedures. Treatment efficacy and safety studies, for example, require controlled treatment administration and the variability in participants' behaviors (e.g., alcohol intake, physical activity, food intake, and sleep) should be kept to a minimum. However, RCTs, per definition, modify and structure participant behaviors in a standardized and, therefore, often "unnatural" way. Therefore, a naturalistic study design is preferred if one aims to better understand or mimic real-life interventions. The lack of standardization of naturalistic studies should, therefore, be considered as a benefit of the study design.

Additionally, free drinking in naturalistic studies often exceeds the intoxication limits deemed safe and ethically approved for RCT studies, which further increases the ecological validity of naturalistic hangover studies compared to RCT hangover studies. To improve the level of control in naturalistic studies, mobile technology can be used to assess objective real-time data and control the quality of assessment, without investigators interfering with participant behaviors.

**Author Contributions:** Conceptualization: All authors; draft of first version of the manuscript: J.C.V. and G.B.; all authors approved the final version.

**Conflicts of Interest:** S.B. has received funding from Red Bull GmbH, Kemin Foods, Sanofi Aventis, Phoenix Pharmaceutical, and GlaxoSmithKline. A.S. has held research grants from Abbott Nutrition, Arla Foods, Bayer Healthcare, Cognis, Cyvex, GlaxoSmithKline, Naturex, Nestle, Martek, Masterfoods, and Wrigley and has acted as a consultant/expert advisor to Abbott Nutrition, Barilla, Bayer Healthcare, Danone, Flordis, GlaxoSmithKline Healthcare, Masterfoods, Martek, Novartis, Unilever, and Wrigley. Over the past three years, J.C.V. has received grants/research support from the Dutch Ministry of Infrastructure and the Environment, Janssen, Nutricia, and Sequential and acted as a consultant/advisor for Clinilabs, More Labs, Red Bull, Sen-Jam Pharmaceutical, Toast!, and ZBiotics. C.A. has undertaken sponsored research, or provided consultancy, for a number of companies and organizations, including Airbus Group Industries, Astra, British Aerospace/BAeSystems, Civil Aviation Authority, Duphar, FarmItalia Carlo Erba, Ford Motor Company, ICI, Innovate UK, Janssen, LERS Synthélabo, Lilly, Lorex/Searle, UK Ministry of Defense, Quest International, Red Bull GmbH, Rhone-Poulenc Rorer, and Sanofi Aventis. The other authors have no potential conflicts of interest to disclose.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Review* **The Assessment of Overall Hangover Severity**

#### **Joris C Verster 1,2,3, Aurora J.A.E. van de Loo 1,2, Sarah Benson 3, Andrew Scholey <sup>3</sup> and Ann-Kathrin Stock 4,\***


Received: 25 February 2020; Accepted: 11 March 2020; Published: 13 March 2020

**Abstract:** The aim of this study was to critically evaluate and compare the different methods to assess overall hangover severity. Currently, there are three multi-item hangover scales that are commonly used for this purpose. All of them comprise a number of hangover symptoms for which an average score is calculated. These scales were compared to a single, 1-item scale assessing overall hangover severity. The results showed that the hangover symptom scales significantly underestimate (subjective) hangover severity, as assessed with a 1-item overall hangover severity scale. A possible reason for this could be that overall hangover severity varies, depending on the frequency of occurrence of individual symptoms included in the respective scale. In contrast, it can be assumed that, when completing a 1-item overall hangover scale, the rating includes all possible hangover symptoms and their impact on cognitive and physical functioning and mood, thus better reflecting the actually experienced hangover severity. On the other hand, solely relying on hangover symptom scales may yield false positives in subjects who report not having a hangover. When the average symptom score is greater than zero, this may lead to non-hungover subjects being categorized as having a hangover, as many of the somatic and psychological hangover symptoms may also be experienced without consuming alcohol (e.g., having a headache). Taken together, the current analyses suggest that a 1-item overall hangover score is superior to hangover symptom scales in accurately assessing overall hangover severity. We therefore recommend using a 1-item overall hangover rating as primary endpoint in future hangover studies that aim to assess overall hangover severity.

**Keywords:** alcohol; hangover; symptoms; severity; measurement; scale; single item assessment

#### **1. Introduction**

The alcohol hangover is defined as the combination of negative mental and physical symptoms which may be experienced the day after a single episode of alcohol consumption, starting when blood alcohol concentration (BAC) approaches zero [1,2]. Alcohol hangovers are typically characterized by a combination of symptoms affecting subjective mood, cognition and physical functioning [3–6]. These symptoms have been shown to negatively impact daily activities such as job performance [7] and driving [8–10]. The annual economic costs of alcohol hangover in terms of absenteeism and presenteeism have been estimated to be 173 billion USD for the USA [11] and 4 billion GBP for the UK [12]. Since the foundation of the Alcohol Hangover Research Group in 2010, the amount of research on the causes, consequences, treatment, and prevention of hangovers has been growing rapidly. Accurate measurement tools are essential to assess hangover severity, for example in experimental

and naturalistic studies, in intervention studies examining treatment efficacy, and in survey research. Generally, they may be filled in by any drinker of all (adult) ages and in any drinking-related (research) context. There is, however, ongoing debate about which measure of hangover severity is most suitable. In this paper, we compare the three most widely used hangover symptom scales with a 1-item overall hangover severity rating, and discusses why the latter is a more reliable and useful measure to be included in future hangover research.

#### **2. Characteristics of an E**ff**ective Patient-Reported Outcome Measure (PROM)**

Currently, there are no biomarkers that accurately and objectively assess hangover severity. Therefore, one has to rely on Patient-Reported Outcome Measures (PROMs). A PRO instrument is often a questionnaire/scale or single item that is directly answered by the patient, capturing the patient's experience without interpretation of the patient's response by a clinician or anyone else [13]. To the ensure the validity of a PRO, it is fundamental that it reliably measures the concept it is intended to measure (i.e., hangover). In the case of a multi-item scale, it is furthermore important that all of the individual items adequately contribute to the final conceptual framework of the instrument [13]. Any given scale can only be considered to have sufficient validity as a measuring tool when these conditions are fulfilled. The complex nature of alcohol hangover severity, which includes multidomain facets associated with the presence and severity of variable symptoms, and their impact on cognitive and physical functioning and mood, makes it quite challenging to develop multi-item scales that accurately assess the concept of hangover severity. Nevertheless, there are currently three hangover symptom scales available for this purpose, and they are commonly used in hangover research [14–16].

Slutske et al. [14] developed the Hangover Symptoms Scale (HSS) to assess the frequency with which drinkers experienced hangover symptoms in the last year. The scale consists of 13 items including "felt extremely thirsty or dehydrated", "felt more tired than usual", "experienced a headache", "felt very nauseous", "vomited", "felt very weak", "had difficulty concentrating", "more sensitive to light and sound than usual", "sweated more than usual", "had a lot of trouble sleeping", "was anxious", "felt depressed", and "experienced trembling or shaking". Items can be scored either dichotomously (experienced the symptom, or not), or on a 5-point scale from "never", "2 times or less" (once or twice per year), "3–11 times" (more than once or twice, but less than once per week), "12–51 times" (more than once a month, but not every week), and "52 times or more" (once per week or more frequently). It is important to underline that the original HSS outcome is a frequency measure. The scale has, however, been modified and used to assess hangover severity, for example by using the same items but changing the item scoring into a symptom rating ranging from 0 (absent) to 10 (extreme) [3].

Rohsenow et al. [15] developed the Acute Hangover Scale (AHS) to assess hangover severity. The scale consists of nine items including "hangover", "thirsty", "tired", "headache", "dizziness/faintness", "loss of appetite", "stomachache", "nausea", and "heart racing", which are rated on a scale ranging from 0 to 7. The anchors of the scale are "none" (score of 0), "mild" (score of 1), "moderate" (score of 4), and "incapacitating" (score of 7). Overall hangover severity is computed by calculating the average score across the AHS nine items.

The third scale, developed by Penning et al. [16], is the Alcohol Hangover Severity Scale (AHSS). It consists of 12 items, including "fatigue (being tired)", "clumsiness", "dizziness", "apathy", "sweating", "shivering", "confusion", "stomach pain", "nausea", "concentration problems", "heart pounding", and "thirst". Symptom severity for each item can be rated on a scale ranging from 0 (absent) to 10 (extreme). Overall hangover severity is the average score across the 12 items.

The three hangover symptom scales each present with some shortcomings, and their limitations are discussed in detail elsewhere [16]. For example, scales do not always include true hangover symptoms (e.g., the HSS includes the item "trouble sleeping", which is experienced before the start of the hangover state). The AHS includes the item "hangover", but it is usually not advised to include an item that is identical to the overall concept that one aims to measure with a multi-item scale. Finally, the AHSS does not include the item headache, even though this is a frequently reported hangover

symptom. Notwithstanding these limitations, the three hangover symptom scales are currently the most frequently used scales to assess hangover severity. Other researchers such as Hogewoning et al. [17] have been using an extended symptom listing, including all symptoms of the three hangover symptom scales.

Alternatively, overall hangover severity may be assessed with a single item rated on a scale ranging from absent (0) to extreme (10). This 1-item score is hypothesized to encompass all symptoms experienced by the drinker, including their relative impact on daily activities and mood.

#### **3. Comparing Overall Hangover Severity Outcomes of the Di**ff**erent Assessment Methods**

When using multi-item instruments, it is important that all items are relevant to the concept under investigation (i.e., alcohol hangover). If irrelevant items were included (e.g., symptoms that are seldomly reported), this would result in an overall hangover severity score that would be biased towards zero, and would therefore tend to underestimate, or even not show, the effects of treatment in intervention studies. In extreme cases, this might even lead to the wrong conclusion that a treatment is ineffective [13].

Penning et al. [3] examined the scientific literature and identified 47 hangover symptoms. However, the AHSS, HSS and AHS each comprise a different selection of these symptoms. This selection of symptoms may have a significant, and potentially biasing effect on the aggregate rating of overall hangover severity. The discrepancy between aggregate symptom scores and 1-item overall hangover severity assessments has been demonstrated previously by Penning et al. [16]. In that study, 947 subjects (Mean (SD) age of 21.1 (2.3) years old, 46 % men) rated the presence and severity of 23 hangover symptoms on a scale ranging from 0 (absent) to 10 (extreme). In addition, overall hangover severity was assessed. Further evaluation of the data showed that mean (SD) scores on a modified HSS and the AHSS were 3.6 (1.4) and 3.7 (1.7), respectively. Of note, the mean (SD) severity score on the 1-item hangover scale, i.e., 5.7 (2.2), was significantly higher (*p* < 0.0001) than both the modified HSS and the AHSS hangover score. These observations suggest that hangover severity scores based on aggregate symptom scores significantly underestimate the true hangover severity assessed with a single overall severity item.

There are three important reasons why aggregate symptom scores may deviate from the true hangover effect. These are related to (1) the relative presence and severity of hangover symptoms, (2) the impact of the experienced symptoms on cognitive functioning, physical activities, and mood, and (3) the fact that several assessed symptoms are also experienced without having a hangover, or even without consuming alcohol at all. These issues are discussed further in the next sections.

#### **4. Presence and Severity of Hangover Symptoms**

The occurrence and severity may differ significantly between symptoms experienced in the hangover state. This is illustrated by evaluating the data by Van Schrojenstein Lantman et al. [4], which is depicted in Figure 1. This study surveyed *n* = 1837 social drinkers who reported overall hangover severity and the presence and severity of individual hangover symptoms experienced in their last hangover in the past month, and rated this on a scale ranging from 0 (absent) to 10 (extreme). On this occasion, they reported consuming a mean (SD) of 12.6 (5.5) alcoholic drinks, corresponding to an estimated peak BAC of 0.19 (0.1) %. Figure 1 shows that both the frequency of occurrence and severity differed considerably between individual hangover symptoms. Most individual symptom scores are lower than the 1-item overall hangover severity score, suggesting that a symptom average score will underestimate overall hangover severity. As the three hangover scales comprise different hangover symptoms, it is understandable that the aggregate symptom scores of these scales differ from each other. Several symptoms, such as depression and anxiety (both low frequency/low severity), have a limited contribution to the aggregate scale score, whereas other symptoms, such as concentration problems and being tired (both high frequency/high severity symptoms), have a large contribution to the aggregate score. Including low frequency/low severity symptoms or excluding high frequency/high

severity symptoms results in an underestimation of the "true" overall hangover severity. It is evident from Figure 1 that the HSS, especially, contains several low frequency/low presence items. Therefore, HSS scores likely underestimate the true hangover severity to a greater extent than the AHS and AHSS.

**Figure 1.** Presence and severity of symptoms included in hangover symptom scales. Data from *n* = 1837 social drinkers who reported on their latest past month hangover [4]. Note: "sensitivity to sound" was not assessed. Abbreviations: HSS = Hangover Symptoms Scale, AHSS = Alcohol Hangover Severity Scale, AHS = Acute Hangover Scale.

A similar variability in the presence and severity of hangover symptoms was recently reported by Van Lawick van Pabst et al. [18]. Omitting relevant items from a scale can have a significant impact on the overall rating of hangover severity. An example from the dataset of Van Schrojenstein Lantman is the item "sleepiness", which is not included in any of the three hangover scales. Sleepiness was reported by 97.1% of participants and its severity was rated as 6.5 out of 10 (extreme). It can be assumed that when completing a single item overall hangover severity item, the subject's rating is influenced by all symptoms and feelings the subject experiences during the hangover state. Therefore, aggregate scale scores of a limited number of symptoms are very likely to underestimate the true overall hangover severity.

#### **5. Negative Impact of Hangover Symptoms**

When judging overall hangover severity, it is likely that drinkers will take into account to what extent all experienced individual hangover symptoms negatively affect their cognitive functioning, physical activities, and mood. Symptoms with the largest negative impact on these domains are not necessarily those symptoms that have the highest severity scores. There is also no relationship between the impact symptoms may have and the relative frequency of occurrence in the overall drinking population. For example, heart racing can be a very disturbing effect and have a significant impact on mood. However, the symptom is not frequently reported. Alternatively, severity scores and the presence of ratings of being thirsty are usually high, while effects on cognitive functioning, physical activities, and mood are virtually absent.

Van Schrojenstein Lantman et al. [4] also examined the impact of experiencing hangover symptoms on cognitive and physical functioning, and mood in *n* = 1837 social drinkers who reported on their last hangover experience in the past month. Negative impact of hangover symptoms on cognitive and physical functioning, and mood was rated on scales ranging from 0 (absent) to 5 (extreme). The results are summarized in Figure 2.

**Figure 2.** Impact of hangover symptoms on cognitive and physical functioning and mood. Negative impact of hangover symptoms cognitive functioning (**A**), physical functioning (**B**), and mood (**C**) was rated on scales ranging from 0 (absent) to 5 (extreme). Note: "sensitivity to sound" was not assessed and "trouble sleeping" was excluded as not being a true hangover symptom. Data from reference [4].

It is evident from Figure 2, that there are clear differences regarding the extent that hangover symptoms have an impact on cognition, physical activities, and mood. Therefore, the specific items that are included in a hangover symptom scale determine to what degree the true overall hangover effect is accurately reflected in an aggregate score (especially if no item weights are used during the formation of the composite score). As the hangover symptom scales do not include all imaginable hangover symptoms, while at the same time providing items that may not apply to a given participant, they will likely underestimate the "true" overall impact of hangover symptoms on cognition, physical activities, and mood. This can again be illustrated by considering the hangover symptom "sleepiness". Although not included in any of the three hangover symptom scales, Van Schrojenstein Lantman et al. [4] found that sleepiness was reported by 97.1% of drinkers. The mean (SD) impact scores for sleepiness were 2.7 (1.7) for cognitive functioning, 2.5 (1.7) for physical functioning, and 2.4 (1.6) for mood. If this symptom was incorporated in a scale, it would very likely have influenced the aggregate impact score.

#### **6. Symptoms May also be Present without a Hangover or Alcohol Consumption**

Hangover symptoms are also experienced when no alcohol is consumed. As a result, aggregate symptom scores may be greater than zero, even when no alcohol has been consumed. A recent study [19] compared hangover symptoms between subjects with and without a hangover and demonstrated that several symptoms are not unique to the hangover state but are also present without having a hangover or consuming alcohol. In this study, *n* = 299 subjects who were on holiday in Greece (mean (SD) age of 38.9 (11.0) years old) completed the AHS in the morning before walking the Samaria Gorge. *n* = 47 subjects consumed alcohol the evening before but reported having no hangover, *n* = 176 consumed alcohol and reported a hangover, and *n* = 76 consumed no alcohol and reported no hangover. Reported hangover symptoms from the three groups are depicted in Figure 3.

**Figure 3.** Presence and severity of symptoms related to alcohol hangover. Note: In contrast to the original AHS, scores range from 0 (absent) to 10 (extreme). Data from reference [19].

First of all, these data again demonstrate that the mean (SD) AHS score of 2.9 (1.3) significantly underestimated the true hangover severity of 4.6 (2.1) among subjects with a hangover (*p* < 0.0001). Statistical comparisons of the individual hangover symptom scores between subjects who consumed no alcohol, those who consumed alcohol but reported no hangover, and drinkers with a hangover revealed that no significant differences between the groups were found for nausea and loss of appetite. Severity scores for headache did not significantly differ between drinkers with and without a hangover. The data in Figure 3 suggest that most symptoms that are attributed to the hangover state are always present, irrespective of alcohol consumption or having a hangover. It can be assumed that, when rating overall hangover severity via a 1-item scale, drinkers take into account that some symptoms may already be present on non-drinking days as well. For example, they may usually feel somewhat tired (although perhaps to a lesser extent than during the hangover state). This knowledge is then incorporated in their rating of hangover severity, which is more likely to reflect the difference/changes in symptom severity relative to a normal non-drinking day. Although the latter cannot be proven with the data at hand, we deem it to be a plausible hypothesis. Yet, hangover scales aggregate symptom scores without taking baseline symptom scores into account. As a result, a positive AHS hangover severity score can be obtained in subjects who reported to have no hangover after drinking alcohol (0.9) as well as in subjects who did not consume alcohol at all (1.0). In fact, their 1-item overall hangover severity score was zero. When relying solely on AHS scores, it would incorrectly be assumed that these subjects had a hangover. In this study, 95% of subjects who report no hangover via the 1-item overall hangover severity rating had an AHS score greater than zero and would be incorrectly labelled as having a hangover. When including these subjects in the dataset for statistical analysis, their AHS scores are, however, higher than those assessed with a 1-item severity score (i.e., zero), meaning that the AHS score overestimates the true overall hangover severity, producing false positives. When taken together, the findings that the severity of a true hangover tends to be underestimated (due the fact that not all of the items usually apply), while severity tends to be overestimated in the absence of a hangover, it seems that composite scores might be worse than 1-item overall ratings in differentiating between individuals with severe versus light hangover symptoms, by producing a tendency towards the middle. While it could theoretically be possible to try to identify false positives by the ratio of single-scale scores to overall ratings (even though this currently still remains to be tested), it would likely be impractical in most cases to have participants fill in an entire questionnaire, when a single-item overall hangover rating already provides the required information to a good, if not even better, degree.

#### **7. Day to Day Variability in the Presence and Severity of Hangover Symptoms**

Van Wijk et al. [20] examined hangover severity of *n* = 22 students who were on a skiing holiday in Italy. The students experienced multiple hangovers during this period. Each morning at breakfast, subjects completed a modified AHS. The AHS included all nine symptoms, including a 1-item overall hangover severity score, but the severity scoring of items was modified to a range from 0 (absent) to 10 (extreme) In addition, past evening alcohol consumption was recorded and the level of subjective intoxication (i.e., drunkenness) was rated on a scale from 0 (absent) to 10 (extreme). For *n* = 13 subjects, it was possible to match two test days with identical hangover scores, as assessed with a 1-item overall hangover severity score. Several important observations were made when evaluating the data: First of all, the 1-item overall hangover severity score was different between subjects, but identical on the two test days (see Figure 4A). However, the AHS scores of the two test days showed considerable variability for some of the subjects (see Figure 4B), as the severity scores of individual hangover symptoms contributing to the aggregate AHS score differed between the two test days (see Figure 4C–J). In other words, despite having identical 1-item overall hangover severity scores, subjects reported considerable variability in individual symptom scores and overall AHS scores on the two test days. Finally, the data showed that having an identical 1-item overall hangover severity score does not necessarily imply that the same amount of alcohol was consumed, or that the corresponding level of reported intoxication was similar on the evening preceding the test day. Instead, subjects consumed different amounts of alcohol

on both test days (see Figure 4K) and reported different levels of subjective intoxication (see Figure 4L). Notwithstanding this, their 1-item overall hangover severity scores on each test day were identical.

**Figure 4.** Level of subjective intoxication and alcohol consumption and corresponding next day hangover symptom severity reported on two different test days by the same subjects. Individual subject ratings for 1-item overall hangover severity (**A**), the AHS score (**B**), individual symptom scores (**C–J**), the amount of alcohol consumed the evening before having the hangover (**K**), and the corresponding level of subjective intoxication (**L**) are shown.

Figure 5 shows the test–retest reliability of the AHS and its items. While the test–retest reliability of the 1-item overall hangover severity score was 1.0 (maximal, as test days had been selected to fulfil this criterion), the AHS test–retest reliability (0.69) was below the generally acceptable level of test–retest reliability of 0.7 [21]. The variability in individual hangover severity scores was greatest for headache, thirst and nausea, and none of the symptoms reached the acceptable limit of 0.7 for test–retest reliability, except for dizziness. Applying the more stringent Bland-Altman 95% limits of agreement method [22]—in which 95% of difference scores of day 1 and day 2 item or scale ratings should lie within the range of two standard deviations of the mean difference score to demonstrate agreement between the two assessments—revealed that no agreement was found for the symptoms of headache, heart racing, and loss of appetite.

**Figure 5.** Test–retest reliability. Spearman's correlations are shown. Higher scores suggest a better test–retest reliability. Bootstrapping (*n* = 10.000 samples, bias corrected 95% confidence interval) was applied to adjust correlations for the small sample size. An acceptable test–retest reliability is demonstrated if Spearman's correlation > 0.7 [21].

An alternative way to look at the data is to select test days on which subjects consumed an identical amount of alcohol and then compare the presence and severity of AHS symptom scores. For *n* = 18 subjects, it was possible to match two test days with an identical amount of alcohol consumption. They consumed a mean (SD) of 11.6 (5.7) alcoholic drinks on these test days (range: 2 to 20 alcoholic drinks). Their AHS scores and individual symptom scores are summarized in Table 1 and Figure 6. Despite the fact that subjects consumed the same amount of alcohol on both test days, the data show considerable variability within subjects in both the presence and severity ratings on individual hangover symptoms, including the 1-item hangover severity score.


18 8 0.56 0.33 0 0

**Table 1.** AHS and symptom severity scores for two days on which an equal amount of alcohol was consumed by subjects.

**Figure 6.** AHS and symptom severity scores for two days on which an equal amount of alcohol was consumed within subjects. Significant differences (*p* < 0.05) between the two days are indicated by an asterisk (\*). Abbreviation: AHS = acute hangover scale.

Test-retest reliability for the AHS (r = 0.731, *p* = 0.001) was acceptable. With regard to individual symptoms, an acceptable test–retest reliability was, however, only found for the symptom of being tired (r = 0.775, *p* < 0.0001). No acceptable test–retest reliability was found for the ratings of overall hangover (r = 0.537, *p* = 0.022), and stomach pain (r = 0.569, *p* = 0.014). A poor test–retest reliability was found for being thirsty (r = 0.365, *p* = 0.136), dizziness (r = 0.351, *p* = 0.153), heart racing (r = 0.240, *p* = 0.337), headache (r = 0.186, *p* = 0.460), and nausea (r = 0.090, *p* = 0.723). The low test–retest reliabilities again confirm the fact that the presence and severity of hangover symptoms considerably varies between drinking occasions. In line with this, a recent study showed that there is great intraindividual variability in hangover severity scores between drinking occasions, even when the same amount of

alcohol is consumed [23], and regression analyses demonstrated that the amount of consumed alcohol is usually not a strong predictor of hangover severity [24].

Table 1 further shows that the AHS scores on day 1 and day 2 are greater than zero for each subject. This could suggest that all of them experienced a hangover on both days. However, the 1-item overall hangover severity score demonstrated this to be incorrect, as six out of 18 subjects (33.3%) did not report having a hangover on day 1, and eleven subjects (61.1%) reported having no hangover on day 2. Taken together, relying solely on hangover symptom scales to assess the presence and severity of alcohol hangover will likely result in inaccurate results.

#### **8. Should We Abandon the Use of Hangover Symptom Scales?**

The fact that the outcomes of composite hangover scales do not appear to accurately reflect overall hangover severity does not imply that we should abandon their use altogether. In many cases, it is very relevant to assess the presence and severity of individual hangover symptoms. For example, if a company claims that treatment X is effective in reducing hangover headaches, it is highly relevant to assess "headache" severity, in addition to an assessment of overall hangover severity. As is evident from Section 2, there is a great variability in the presence and severity of individual hangover symptoms. It is also important to identify those symptoms that are most bothersome and impairing to subjects. For future research, it is therefore recommended to also assess individual hangover symptoms. This can be done by using one of the existing hangover scales, or simply by assessing individual symptoms of interest via symptom-specific 1-item severity scores. However, the judgement of the overall efficacy of a hangover treatment should preferably be based on a 1-item overall hangover severity rating, as this most likely incorporates all experienced symptoms and circumstances of the hangover state. As discussed in Section 5, symptoms experienced during hangover may also be experienced on non-hangover days. Therefore, is advisable to use difference scores for these individual symptoms when comparing their severity on a hangover day versus a no-hangover day to capture the "true" hangover effect. The latter does not apply for intervention studies, where a direct comparison of symptom scores between treatment and placebo should be made to evaluate a possible difference between the two hangover conditions.

Finally, it should be acknowledged that subjective ratings have sometimes been argued to be unreliable per se, thus mandating the assessment of biomarkers in order to have an objective assessment of hangover severity. Indeed, it would be very useful if such a biomarker would be discovered. Biomarkers related to alcohol metabolism and immune function may be promising candidates, but several lines of research have unfortunately not yet been able identified a suitable biomarker. Biomarkers can be assessed in various samples, including blood, saliva, hair, sweat, and stool, but, for practical use, volatiles in expired air would be ideal. However, the traditional breathalyzer readings reflecting ethanol concentrations are not useful, as BAC readings are often zero in the hangover state [25]. Alas, chemical compounds other than ethanol, which should ideally relate to hangover severity and/or functional impairments, would need to be detected by a breathalyzer in order to reliably indicate alcohol hangover. Notwithstanding this, the alcohol hangover is a complex state with many symptoms that may be experienced alone or in combination and may have differential severity and impact on daytime functioning. Given the currently available knowledge on the potential underlying mechanisms, it is still unclear whether a complex concept such as alcohol hangover can be accurately represented by a single biomarker. Further research investigating the suitability of potential volatile biomarkers in order to develop a breathalyzer for the hangover state is currently in progress.

#### **9. Conclusions**

Hangover symptoms can vary in their presence and severity between different drinking events in the same individual, and not all symptoms have equal impact in terms of impairment or being bothersome, regardless of their severity. Therefore, researchers should not rely solely on hangover symptom scales to assess overall hangover severity. Based on our reasoning, it is evident that composite, multi-item hangover symptom scales will likely underestimate severe hangovers and, at the same time, overestimate light hangovers, thus partly masking the true hangover effect. The resulting reduction in variance could hamper the ability to assess changes across study appointments or interventions and may thus have serious implications for the interpretation of study outcomes. Furthermore, the use of hangover symptom scales may also contribute to false positives, i.e., misidentifying subjects without a hangover as supposedly suffering from one. We propose that the most important, clinically meaningful endpoint for rating hangovers and the prevention or mitigation of hangovers by an effective product is the 1-item overall hangover severity score. This measurement allows the subject to assess the effect the condition is having on him or her, taking into account the symptoms being experienced, the severity of the symptoms being experienced and, most importantly, how the condition is impacting them in their activities of daily living and interactions with others, regardless of what the individual symptoms comprising his or her condition may be at the moment. The single greatest strength of the 1-item global assessment as a primary outcome measure is that it incorporates the subjects' evaluation of the impact the specific subset of symptoms being experienced at that time, in place of and with greater subject-focused information value than the specific symptom-based sum score can provide. Thus, this single-score approach evaluates the entire constellation of the hangover state, regardless of the individual components contributing to it, in terms of presence, severity, and impact.

Thus, the 1-item overall hangover severity rating represents a self-reported outcome instrument capable of measuring the severity of a condition (i.e., hangover) or the effect of a treatment in concordance with and incorporating all three concepts of an effective Patient-Reported Outcome Measure, namely, assessing the presence of symptoms, their effects on function, and their severity [13]. In addition, as secondary outcome measures (of efficacy), individual symptom presence and severity (or their impact) can be assessed using either hangover symptom scales, or individually. This will identify which individual symptoms are described by subjects as being the most bothersome during the alcohol hangover state.

**Author Contributions:** Conceptualization, J.C.V., A.J.A.E.v.d.L., S.B., A.S., A.-K.S., formal analysis, J.C.V.; writing—original draft preparation, J.C.V.; writing—review and editing, J.C.V., A.J.A.E.v.d.L., S.B., A.S., A.-K.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** Open Access Funding by the Publication Fund of the TU Dresden.

**Conflicts of Interest:** S.B. has received funding from Red Bull GmbH, Kemin Foods, Sanofi Aventis, Phoenix Pharmaceutical and GlaxoSmithKline. Over the past 36 months, A.S. has held research grants from Abbott Nutrition, Arla Foods, Bayer, BioRevive, DuPont, Fonterra, Kemin Foods, Nestlé, Nutricia-Danone, Verdure Sciences. He has acted as a consultant/expert advisor to Bayer, Danone, Naturex, Nestlé, Pfizer, Sanofi, Sen-Jam Pharmaceutical, and has received travel/hospitality/speaker fees Bayer, Sanofi and Verdure Sciences. Over the past 36 months, J.C.V. has held grants from the Dutch Ministry of Infrastructure and the Environment, Janssen, Nutricia, and Sequential, and acted as a consultant/expert advisor to Clinilabs, Morelabs, Red Bull, Sen-Jam Pharmaceutical, Toast!, and ZBiotics. A.K.S. has received funding from the Daimler and Benz Foundation. A.J.A.E.V.D.L. has no conflicts of interest to declare.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
