3.1. Hardware and Data Acquisition
The video data for this investigation were captured by BeePi monitors, multi-sensor EBM systems we designed and built in 2014 [
8], and have been iteratively modifying since then [
7,
22]. Each BeePi monitor (see
Figure 1) consists of a raspberry pi 3 model B v1.2 computer, a pi T-Cobbler, a breadboard, a waterproof DS18B20 temperature sensor, a pi v2 8-megapixel camera board, a v2.1 ChronoDot clock, and a Neewer 3.5 mm mini lapel microphone placed above the landing pad. All hardware components fit in a single Langstroth super. BeePi units are powered either from the grid or rechargeable batteries.
BeePi monitors thus far have had six field deployments. The first deployment was in Logan, UT (September 2014) when a single BeePi monitor was placed into an empty hive and ran on solar power for two weeks. The second deployment was in Garland, UT (December 2014–January 2015), when a BeePi monitor was placed in a hive with overwintering honeybees and successfully operated for nine out of the fourteen days of deployment on solar power to capture ≈200 MB of data. The third deployment was in North Logan, UT (April–November 2016) where four BeePi monitors were placed into four beehives at two small apiaries and captured ≈20 GB of data. The fourth deployment was in Logan and North Logan, UT (April–September 2017), when four BeePi units were placed into four beehives at two small apiaries to collect ≈220 GB of audio, video, and temperature data. The fifth deployment started in April 2018, when four BeePi monitors were placed into four beehives at an apiary in Logan, UT. In September 2018, we decided to keep the monitors deployed through the winter to stress test the equipment in the harsh weather conditions of northern Utah. By May 2019, we had collected over 400 GB of video, audio, and temperature data. The sixth field deployment started in May 2019 with four freshly installed bee packages and is still ongoing as of January 2021 with ≈250 GB of data collected so far. In early June 2020, we deployed a BeePi monitor on a swarm that made home in one of our empty hives and have been collecting data on it since then.
We should note that, unlike many apiarists, we do not intervene in the life cycle of the monitored hives in order to preserve the objectivity of our data and observations. For example, we do not apply any chemical treatments to or re-queen failing or struggling colonies.
3.2. Terminology, Notation, and Definitions
We use the terms frame and image interchangeably to refer to two-dimensional (2D) pixel matrices where pixels can be either real non-negative numbers or, as is the case with multi-channel images (e.g., PNG or BMP), tuples of real non-negative numbers.
We use pairs of matching left and right parentheses to denote sequences of symbols. We use the set-theoretic membership symbol ∈ to denote when a symbol is either in a sequence or in a set of symbols. We use the universal quantifier ∀ to denote the fact that some mathematical statement holds for all mathematical objects in a specific set or sequence and use the existential quantifier ∃ to denote the fact that some mathematical statement holds for at least one object in a specific set or sequence.
We use the symbols ∧ and ∨ to refer to the logical and and the logical or, respectively. Thus, states a common truism that for every natural number i there is another natural number j greater than i.
Let
and
be sequences of symbols, where
are positive integers. We define the intersection of two symbolic sequences
in Equation (
1), where
are positive integers,
whenever
, for
and
. If two sequences have no symbols in common, then
.
We define a video to be a sequence of consecutive equi-dimensional 2D frames , where t, j, and k are positive integers such that . Thus, if a video V contains 745 frames, then = , . By definition, videos consist of unique frames so that if and , then . It should be noted that and may be the same pixelwise. Any smaller sequence of consecutive frames of a larger video is also a video. For example, if is a video, then so are and .
When we discuss multiple videos that contain the same frame symbol or when we want to emphasize specific videos under discussion, we use superscripts in frame symbols to reference respective videos. Thus, if videos and include , then designates in and designates in .
Let
be a video,
, and
be a positive integer. A frame’s context in
V, denoted as
, is defined in Equation (
2).
where
In other words, the context of is a video that consists of a sequence of consecutive or fewer frames (possibly empty) that precede and a sequence of or fewer frames (possibly empty) that follow it. We refer to as a context size and to as the -context or, simply, context of and refer to as the contextualized frame of . If there is no need to reference V, we omit the superscript and refer to as the contextualized frame of .
For example, let
, then the 3-context of
is
Analogously, the 2-context of
is
If
, then
is the pixel value at row
r and column
c in
. If
is a frame whose pixel values at each position
are real numbers, we use the notation
to refer to the maximum such value in the frame. If
, then the mean frame of
V, denoted as
or
when
V can be omitted, is the frame where the pixel value at
is the mean of the pixel values at
of all frames
, as defined in Equation (
3).
If pixels are n-dimensional tuples (e.g., each is an RGB or PNG image), then each pixel in is an n-dimenstional tuple of the means of the corresponding tuple values in all frames .
Let
be a video, and let
be a context size and
be the context of
. The
l-th dynamic background frame of
V, denoted as
,
, is defined in Equation (
4) as the mean frame of
(i.e., the mean frame of the
-context of
).
In general, the dynamic background operation specified in Equation (
4) is designed to filter out noise, blurriness, and static portions of the images in a given video. As the third video set in the
supplementary material shows, BeePIV can process videos taken against the background of grass, trees, and bushes. For an example, consider a 12-frame video
V in
Figure 2, and let
.
Figure 3 shows 12 dynamic background frames for the video in
Figure 2. In particular,
is the mean frame of
of
;
is the mean frame of
of
;
is the mean frame of
of
; and
is the mean frame of
of
. Proceeding to the right in this manner, we reach
, the last contextualized frame of
V, which is the mean frame of
of
.
A neighborhood function maps a pixel position
in
to a set of positions around it. In particular, we define two neighborhood functions
(see Equation (
5)) and
(see Equation (
6)) for the standard 4- and 8-neighborhoods, respectively, used in many image processing operations. Given a position
in
, the statement
states that the 8-neighborhood of
includes a position
such that
and
.
We use the terms PIV and digital PIV (DPIV) interchangeably as we do the terms bee and honeybee. Every occurrence of the term bee in the text of the article refers to the Apis Mellifera honeybee and to no other bee species. We also interchange the terms hive and beehive to refer to a Langstroth beehive hosting an Apis Mellifera colony.
3.3. BeePIV
3.3.1. Dynamic Background Subtraction
Let
be a video. In BeePIV, the background frames are subtracted pixelwise from the corresponding contextualized frames to obtain difference frames. The
l-th difference frame of
V, denoted as
, is defined in Equation (
7).
The pixels in
that are closest to the corresponding pixels in
represent positions that have remained unchanged over a specific period of physical time over which the frames in
were captured by the camera. Consequently, the positions of the pixels in
where the difference is relatively high, signal potential bee motions.
Figure 4 shows several difference frames computed from the corresponding contextualized and background frames of the 12-frame video in
Figure 2.
Let
be a video. We now introduce the video difference operator
in Equation (
8) that applies the operation in Equation (
7) to every contextualized frame
,
.
For example, if
V is the 12-frame video in
Figure 2 and
, then
and
contains each of the four difference frames (i.e.,
,
,
,
) in
Figure 4.
3.3.2. Difference Smoothing
A difference frame may contain not only bee motions detected in the corresponding contextualized frame but also bee motions from the other frames in the context or motions caused by flying bees’ shadows or occasional blurriness. In BeePIV, smoothing is applied to to replace each pixel with a local average of the neighboring pixels. Insomuch as the neighboring pixels measure the same hidden variable, averaging reduces the impact of bee shadows and blurriness without necessarily biasing the measurement, which results in more accurate frame intensity values. An important objective of difference smoothing is to concentrate intensity energy in those areas of that represent actual bee motions in .
The smoothing operator,
, is defined in Equation (
9), where
is a real positive number and
is a weighting function assigning relative importance to each neighborhood position. In the current implementation of BeePIV,
. We will use the notation
to denote the smoothed difference frame obtained from
so that
. We will use the notation
as a shorthand for the application of
to every position
in
to obtain
.
Figure 5 shows several smoothed difference frames obtained from the corresponding difference frames where the weights are assigned using the weight function
in Equation (
10).
Let
and let
. The video smoothing operator
applies the smoothing operator
H to every frame in
and returns a sequence of smoothed difference frames
, as defined in Equation (
11), where
is a sequence of difference frames.
3.3.3. Color Variation
Since a single flying bee may generate multiple motion points in close proximity as its body parts (e.g., head, thorax, and wings) move through the air, pixels in close proximity (i.e., within a certain distance) whose values are above a certain threshold in smoothed difference frames can be combined into clusters. Such clusters can be reduced to single motion points on a uniform (white or black) background to improve the accuracy of PIV. Our conjecture, based on empirical observations, is that a video’s color variation levels and its bee traffic levels are related and that video-specific thresholds and distances for reducing smoothed difference frames to motion points can be obtained from video color variation.
Let
be a video and let
(see Equation (
4)) be the background frame of
with
(i.e., the number of frames in
V). To put it differently, as Equation (
4) implies,
is the mean frame of the entire video and contains information about regions with little or no variation across all frames in
V.
A color variation frame, denoted as
(see Equation (
12)), is computed for each
as the squared smoothed pixelwise difference between
and
across all image channels.
The color variation values from all individual color variation frames
are combined into one maximum color variation frame, denoted as
(see Equation (
13)), for the entire video
V. Each position
in
holds the maximum value for
across all color variation frames
, where
. In other words,
contains quantized information for each pixel position on whether there has been any change in that position across all frames in the video.
Figure 6 gives the background and maximum color variation frames for a low bee traffic video
and a high bee traffic video
. The maximum color variation frames are grayscale images whose pixel values range from 0 (black) to 255 (white). Thus, the whiter the value of a pixel in a
is, the greater the color variation at the pixel’s position. As can be seen in
Figure 6,
has fewer whiter pixels compared to
and the higher color variation clusters in
tend to be larger and more evenly distributed across the frame than the higher color variation clusters in
.
The color variation of a video
V, denoted as
, is defined in Equation (
14) as the standard deviation of the mean of
. Higher values of
indicate multiple motions; lower values indicate either relatively few motions or complete lack thereof. We intend to investigate this implication in our future work.
We postulate in Equation (
15) the existence of a function
from reals to 2-tuples of reals that maps color variation values for videos (i.e.,
) to video-specific threshold and distance values that can be used to reduce smoothed difference frames to uniform background frames with motion points. In
Section 4.2, we define one such
function in Equation (
33) and evaluate it in
Section 4.3, where we present our experiments on computing the PIV interrogation window size and overlap from color variation.
Suppose there is a representative sample of bee traffic videos
obtained from a deployed BeePi monitor. Let
and
be experimentally observed lower and upper bounds, respectively, for the values of
. In other words, for any
,
. Let
and
be the experimentally selected lower and upper bounds, respectively, for
, that hold for all videos in the sample. Then
in Equation (
15) can be constrained to lie between
and
, as shown in Equation (
16).
The frame thresholding operator,
, is defined in Equation (
17), where, for any position
in
,
if
and
, otherwise. The video thresholding operator,
, applies
to every frame in
and returns a sequence of smoothed tresholded difference frames
, as defined in Equation (
18), where
.
In
Section 4.2, we give the actual values of
and
we found by experimenting with the videos on our testbed dataset.
Figure 7 shows the background frame of the video in
Figure 2 and the value of
computed from
for the video.
Figure 8 shows the impact of smoothing difference frames with the smoothing operation
H (see Equation (
9)) and then thresholding them with
computed from
.
The values of are used to threshold smoothed difference frames from a video V and the values of , as we explain below, are used to determine which pixels are in close proximity to local maxima and should be eroded. Higher values of indicate the presence of higher bee traffic and, consequently, must be accompanied by smaller values between maxima points, because in higher traffic videos, multiple bees fly in close proximity to each other. On the other hand, lower values of indicate lower traffic and must be accompanied by higher values of , because in lower traffic videos bees typically fly farther apart.
3.3.4. Difference Maxima
Equation (
19) defines a maxima operator
that returns 1 if a given position in a smooth thresholded difference frame
is a local maxima by using the neighborhood function
in Equation (
5). By analogy, we can define
, another maxima operator to do the same operation by using the neighborhood function
in Equation (
6).
We will use the notation
, where
n is a positive integer (e.g.,
or
), as a shorthand for the application of
to every position of
in
to obtain the frame
. The symbol
b in the subscript of
indicates that this frame is binary, where, per Equation (
19), 1’s indicate positions of local maxima.
Figure 9 shows that application of
to a
smoothed difference frame
to obtain the corresponding
.
The video maxima operator
applies the maxima operator
to every frame in a sequence of smoothed thresholded difference frames
, and returns a sequence of binary difference frames, as defined in Equation (
20), where
is a sequence of smoothed thresholded difference frames.
3.3.5. Difference Maxima Erosion
Let
P be the sequence of
positions of all maxima points in the frame
. For example, in
Figure 9,
. We define an erosion operator
that, given a distance
, constructs the set
P, sorts the positions
in
P by their
i coordinates, and, for every position
such that
, sets the pixel values of all the positions that are within
pixels of
in
to 0 (i.e., erodes them). We let
refer to the smoothed and eroded binary difference frame obtained from
after erosion and define the erosion operator
E in Equation (
21).
The application of the erosion operator is best described algorithmically. Consider the
binary difference frame in
Figure 10a. Recall that this frame is the frame in
Figure 9b obtained by applying the maxima operator
to the frame in
Figure 9a. After
is applied, the sequence of the local maxima positions is
.
Let us set the distance parameter of the erosion operator to 4 and compute . As the erosion operator scans left to right, the positions of the eroded maxima are saved in a dynamic lookup array (let us call it I) so that the previously eroded positions are never processed more than once. The array I holds the index positions of the sorted pixel values at the positions in P. Initially, in our example, , because the pixel values at the positions in P, sorted from lowest to highest, are so that the value 125.50 is at position 4 in P, the value 134.00 at position 1, the value 136.00 at position 3, and the value 143 at position 2. In other words, the pixel value at is the lowest and the pixel value at the highest. For each value in I which has not yet been processed, the erosion operator computes the euclidean distance between its coordinates and the coordinates of each point to the right of it in I that has not yet been processed. For example, in the beginning, when the index position is at 1, the operator computes the distances between the coordinates of position 1 and the coordinates of positions 2, 3, and 4.
If the current point in
I is at
, a point to the left of it in
I is at
, then
is the distance between them. If
, the point at
is eroded and is marked as such. The erosion operator continues to loop through
I, skipping the indices of the points that have been eroded. In this example, the positions 2 and 3 in
P are eroded. Thus, the frame
shown in
Figure 10b, has 1’s at positions
and
and 0’s everywhere else.
Since the erosion operator is greedy, it does not necessarily ensure the largest pixel values are always selected, because their corresponding positions may be within the distance of a given point whose value may be lower. In our example, the largest pixel value 143 at position is is eroded, because it is within the distance threshold from , which is considered first.
A more computationally involved approach to guarantee the preservation of relative local maxima is to sort the values in in descending order and continue to erode the positions within pixels of the position of each sorted maxima until there is nothing else to erode. In practice, we found that this method does not contribute to the accuracy of the algorithm due to the proximity of local maxima to each other. To put it differently, it is the positions of the local maxima that matter, not their actual pixel values in smoothed difference frames in that the motion points generated by the multiple body parts of a flying bee have a strong tendency to cluster in close proximity.
The video erosion operator
applies the erosion operator
to every frame in a sequence of binary difference frames
and returns a sequence of corresponding eroded frames
, as defined in Equation (
22), where
is a sequence of difference frames.
The positions of 1’s in each
are treated as centers of small circles whose radius is
of the width of
, where
in our current implementation. The
parameter, in effect, controls the size of the motion points for PIV. After the black circles are drawn on a white background the frame
becomes the frame
. We refer to
frames as motion frames and define the drawing operator
in Equation (
23), where
is obtained by drawing at each position with 1 in
a black circle with a radius of
of the width of
.
Figure 11 shows twelve motion frames obtained from the twelve frames in
Figure 2.
Figure 12 shows the detected motion points in each motion frame
plotted on the corresponding original frame
from which
was obtained. The motion frames
record bee motions reduced to single points and constitute the input to the PIV algorithm described in the next section.
The video drawing operator
applies the drawing operator
to every frame in a sequence of eroded frames
and returns a sequence of the corresponding white background frames with black circles
, as defined in Equation (
24), where
is a sequence of eroded frames. The sequence of motion frames is given to the PIV algorithm described in the next section.
3.3.6. PIV and Directional Bee Traffic
Let
be a sequence of motion frames and let
and
be two consecutive motion frames in this sequence that correspond to original video frames
and
. Let
be a
window, referred to as interrogation area or interrogation window in the PIV literature, selected from
and centered at position
. Another
window,
, is selected in
so that
. The position of
in
is the function of the position of
in
in that it changes relative to
to find the maximum correlation peak. For each possible position of
in
, a corresponding position
is computed in
.
The 2D matrix correlation is computed between
and
with the formula in Equation (
25), where
are integers in the interval
. In Equation (
25),
and
are the pixel intensities at locations
in
and
in
. For each possible position
of
inside
, the correlation value
is computed. If the size of
is
and the size of
is
, then the size of the matrix
C is
.
The matrix
C records the correlation coefficient for each possible alignment of
with
. A faster way to calculate correlation coefficients between two image frames is to use the Fast Fourier Transform (FFT) and its inverse, as shown in Equation (
26). The reason why the computation of Equation (
26) is faster than the computation of Equation (
25) is that
and
must be of the same size.
If is the maximum value in C and is the center of , the pair of and defines a displacement vector from in to in . This vector represents how particles may have moved from to . The displacement vectors form a vector field used to estimate possible flow patterns.
Figure 13 shows two frames
and
, two corresponding motion frames
and
obtained from them by the application of the operator in Equation (
24), and the vector field
computed from
and
by Equation (
26). The two displacement vectors correspond to the motions of two bees with the left bee moving slightly left and the right bee moving down and right.
In Equation (
27), we define the PIV operator that applies to two consecutive motion frames
and
to generate a field of displacement vectors
. The video PIV operator
applies the PIV operator
G to every pair of consecutive motion frames
and
in a sequence of motion frames and returns a sequence of vector fields
, as defined in Equation (
28), where
is a sequence of eroded frames. It should be noted that the number of motion frames in
Z exceeds the number of the vector fields returned by the video PIV operator by exactly 1.
After the vector fields are computed by the
G operator for each pair of consecutive motion frames
and
, the directions of the displacement vectors are used to estimate directional bee traffic. Each vector is classified as
lateral,
incoming, or
outgoing according to the value ranges in
Figure 14. A vector
is classified as outgoing if its direction is in the range
, as incoming if its direction is in the range
, and as lateral otherwise.
Let
and
be two consecutive motion frames from a video
V. Let
,
, and
be the counts of incoming, outgoing, and lateral vectors. If
is a sequence of
k motion frames obtained from
V, then
,
, and
can be used to define three video-based functions
,
, and
that return the counts of incoming, outgoing, and lateral displacement vectors for
Z, as shown in Equation (
29).
For example, let such that , , and , , . Then, , , and .
We define the video motion count operator in Equation (
30) as the operator that returns a 3-tuple of directional motion counts obtained from a sequence of motion frames
Z with
,
, and
.
3.3.7. Putting It All Together
We can now define the BeePIV algorithm in a single equation. Let
be a video. Let the video’s color variation (i.e.,
) be computed with Equation (
14) and let the values of
, and
be computed from
with Equation (
15). We also need to select a context size
and the parameter
for the circle drawing operator
in Equation (
24) to generate motion frames
.
The BeePIV algorithm is defined in Equation (
31) as the operator
that applies to a video
V. The operator is a composition of operators where each subsequent operator is applied to the output of the previous one. The operator starts by applying the video difference operator
in Equation (
8) to subtract from each contextualized frame
its background frame
. The output of
is given to the video smoothing operator
in Equation (
11). The smoothed frames produced by
are given to the video maxima operator
in Equation (
20) to detect positions of local maxima. The frames produced by
are given to the video erosion operator
in Equation (
22). The eroded frames produced by
are processed by the video drawing operator
in Equation (
24) that turns these frames into white background motion frames and gives them to the video motion count operator
in Equation (
30) to return the counts of incoming, outgoing, and lateral displacement vectors. We refer to Equation (
31) as the BeePIV equation.
Figure 15 gives a flowchart of the BeePIV algorithm.