3.4.2. Individual Endpoints

Among the parameters studied at the individual level, 56% were non-behavioral parameters.

The most frequent non-behavioral endpoint (Figure S5b) was the mortality rate (73%), which was evaluated with di fferent techniques. Histology techniques (9%) were used to study tissue damage, and in particular, the workers' hypopharyngeal gland ultrastructure. The impact of stressors on the development, immunity or reproduction of honey bees was also assessed.

The behavioral trials (Figure S5c) mainly used the proboscis extension reflex (PER) in conditioning protocols (21%) or not (5%). Observation cages or hives were also used extensively (18% and 6%, respectively), sometimes in association with cameras (video-tracking). The cages were cardboard boxes, Petri dishes or other devices. These devices, as well as other systems often designed by the authors and used to study phototaxy, were used to evaluate abnormal behavior, locomotion, dance and activities in the hive as well as social interactions.

Foraging is an important parameter of behavior, which is mainly studied by counting the foragers in the fields or by recording the number of bees entering and/or exiting the hive. Other methods have been identified but are rarely used, such as pollen traps or weighing of foragers. Marking individuals with color marks or radars (harmonic or radio frequency identification) was used in 11% of cases. Marking was often associated with artificial feeders to study the flight parameters of foragers, or with releasing the honey bees at distance from the hive to test their ability to return.

This inventory revealed a grea<sup>t</sup> number of techniques used to study a multitude of parameters. This grea<sup>t</sup> diversity of methods may be related to the fact that at present there are only five standard procedures to test chemicals on bees: two acute toxicity tests by ingestion or by contact with adults [36,37], a chronic oral toxicity test with adults [38] and two larval intake toxicity tests [39,40]. A test to evaluate homing success [41] is currently being ring tested for validation, and has not ye<sup>t</sup> been fully accredited. Various tests have been listed [42], but standard tests are still under development and standardization e fforts need to be continued.

The diversity of the tests was also related to the large number of parameters that can be evaluated. It is therefore important to identify the most relevant parameters for assessing bee health.

#### *3.5. Did the Evidence of an Impact Vary According to the Type of Study or to the Scale of Study?*

In this part of the analysis, we analyzed the impact of the stressor on the endpoint studied (e.g., workers' mortality rate, expression of detoxication genes, immune enzymes' activity, brood capping rate, etc.). The objectives were to determine if the evidence of an impact varied according to the type of study or the scale of study. The impact on the parameter could be "positive" or "negative", but this modality was not recorded.

#### 3.5.1. Type of Study

We investigated whether the parameters studied in the articles were di fferently a ffected by a stressor depending on whether the study was carried out in the field (38%) or in a laboratory (48%). In the field, the di fference between the number of impacted and non-impacted parameters was very small (Figure 6) ye<sup>t</sup> statistically significant (*p* = 0.04, Chi-square test). In laboratory studies, two-thirds of the parameters were impacted by the stressor and one third was not. This di fference could be explained by the e ffect of stress exposure, by a dose e ffect—the doses tested in a laboratory may be higher than the doses to which bees are exposed in fields—or by other e ffects such as co-exposure to multiple stressors and interactions between di fferent products, which are di fficult to control in field experiments.

**Figure 6.** Number of parameters impacted or not by the stressor studied according to the type of experimentation (*n* = 1532). (\* *p* < 0,05; \*\* *p* < 0,01; \*\*\* *p* < 0,001, Chi square test). "Field studies" refer to studies in which the treatment was performed outside, in hives placed in fields or directly in the fields; "laboratory studies" refer to studies in which the treatment was conducted in the laboratory.

#### 3.5.2. Scale of Study

When comparing the results obtained at different scales of study, at the colony scale the difference between the number of impacted and non-impacted parameters was not significant (Figure 7, *p* = 0.607, Chi-square test). On all other scales, significantly more parameters were demonstrated to be impacted by a stressor than not impacted. This result demonstrated the buffering effect of the colony, which compensates for individual effects.

**Figure 7.** Number of parameters impacted or not by the stressor studied according to the scale of study (*n* = 1805). (\* *p* < 0,05; \*\* *p* < 0,01; \*\*\* *p* < 0,001, NS = Not significant, Chi square test).

Therefore, the same result was observed for parameters studied in the field and parameters studied at the colony level; the difference between impacted and not-impacted parameters was not significant. However, colony endpoint measurements were the main parameters studied in field tests. Therefore, we could not determine whether this effect on the impact of stressors was linked to the type of experimentation or to the buffer effect of the colony. It is very likely that both were involved and other confounding factors may also be responsible for this. Nevertheless, these two effects represent important methodological and biological barriers. Indeed, under natural conditions, it is very difficult to control bees' exposure to stressors and their interactions [9]. In addition, conducting robust studies requires a very large number of replicas, which may pose methodological problems in the field [43]. Mathematical modeling methods might circumvent these technical difficulties.

The buffer effect of the colony is very difficult to take into account in experiments and risk assessment. Indeed, when the individuals of a colony are affected by a stress, if the colony sets up measures to compensate for these individual effects, the impact of the stressor will not be evidenced while the colony is suffering. For example, Henry et al. [12] have shown that when the colony loses its foragers in an abnormally large way, it changes the way the reproductive effort is allocated between the brood of workers and drones: production of males is delayed, while the production of workers is strengthened. The colony size is then maintained, as well as the honey production. Thus, the colony appears to be in good health even though its foragers disappear and the delay in male production may be problematic for mating. This also raises the question of the time scale of an experiment: how long can a colony compensate for a stress without visibly suffering? Are the tests long enough to observe deleterious effects on colonies? It is essential to set up techniques that address these issues. Radio frequency identification (RFID) chips are a first solution since they enable real-time observation of foragers' disappearance.
