Figure 1.
In our framework, the robot joining an ongoing social group interaction involves four steps. First, robot perceives the scene using on-board sensors and extracts hand-engineered features. Second, the features are used to detect groups in the scene. Furthermore, the robot recognizes the patterns of groups by estimating the F-formations, and, finally, the robot finds the optimal spot and joins the ongoing social group interactions. The steps in the red rectangle—detecting groups and estimating F-formations—are addressed in this paper.
Figure 1.
In our framework, the robot joining an ongoing social group interaction involves four steps. First, robot perceives the scene using on-board sensors and extracts hand-engineered features. Second, the features are used to detect groups in the scene. Furthermore, the robot recognizes the patterns of groups by estimating the F-formations, and, finally, the robot finds the optimal spot and joins the ongoing social group interactions. The steps in the red rectangle—detecting groups and estimating F-formations—are addressed in this paper.
Figure 2.
(a–d) present the four F-formations. (d,e) represent the same circular formation, (d) presents the ego-centric view, and (e) presents the top view. F-formations define three spaces. O-space is the empty space between people involved in the interaction. P-space is the narrow strip on which people stand while conversing. R-space is the space beyond the P-space, as seen in (e).
Figure 2.
(a–d) present the four F-formations. (d,e) represent the same circular formation, (d) presents the ego-centric view, and (e) presents the top view. F-formations define three spaces. O-space is the empty space between people involved in the interaction. P-space is the narrow strip on which people stand while conversing. R-space is the space beyond the P-space, as seen in (e).
Figure 3.
(a,c,e) present the constraint-based formations. (a,b) represent the triangular formation. (a) presents the top view whereas (b) presents the ego-centric view. (c,d) present the semi-circular formation, (c) the ego-centric view, and (d) the top view.
Figure 3.
(a,c,e) present the constraint-based formations. (a,b) represent the triangular formation. (a) presents the top view whereas (b) presents the ego-centric view. (c,d) present the semi-circular formation, (c) the ego-centric view, and (d) the top view.
Figure 4.
Overview of our framework. The robot uses on-board sensors to process the scene and extract the number of people, and the spatial and orientational information of people. These hand-engineered features are used to detect the groups—using the O-spaces and the transactional segments of people. Then, the patterns of the groups are recognized by estimating the F-formations using a classifier and the variance of group members.
Figure 4.
Overview of our framework. The robot uses on-board sensors to process the scene and extract the number of people, and the spatial and orientational information of people. These hand-engineered features are used to detect the groups—using the O-spaces and the transactional segments of people. Then, the patterns of the groups are recognized by estimating the F-formations using a classifier and the variance of group members.
Figure 5.
V6KP estimates the head orientation of people in the scene. The left part presents one image of two groups interacting in the scene. One group comprises four members and the second group comprises two members. The right part presents four images of the individual people being detected and their head orientation is calculated for the first group. The cropped images of their head and their facing direction along with orientation are presented in the right part.
Figure 5.
V6KP estimates the head orientation of people in the scene. The left part presents one image of two groups interacting in the scene. One group comprises four members and the second group comprises two members. The right part presents four images of the individual people being detected and their head orientation is calculated for the first group. The cropped images of their head and their facing direction along with orientation are presented in the right part.
Figure 6.
(a) The person and their transactional segment. (b) The circle is the O-space and the black shaded region indicates the overlap (mapping) of person’s transactional segment with the O-space.
Figure 6.
(a) The person and their transactional segment. (b) The circle is the O-space and the black shaded region indicates the overlap (mapping) of person’s transactional segment with the O-space.
Figure 7.
Image (b) is the virtual scene of image (a), which is an original image from the coffee dataset. The similarities in image (b), with respect to image (a), could be seen as the floor, plants, walls with windows, doors, lamps, same amount of VAs as people, and their appearance. The table with machines for coffee and tea, glasses, snacks, and water bottles could also be seen image (b). The VAs in the simulation scene are created in such a way that their appearance resembles—i.e., make them appear as close as possible to—the people in the scene.
Figure 7.
Image (b) is the virtual scene of image (a), which is an original image from the coffee dataset. The similarities in image (b), with respect to image (a), could be seen as the floor, plants, walls with windows, doors, lamps, same amount of VAs as people, and their appearance. The table with machines for coffee and tea, glasses, snacks, and water bottles could also be seen image (b). The VAs in the simulation scene are created in such a way that their appearance resembles—i.e., make them appear as close as possible to—the people in the scene.
Figure 8.
The global view of the virtual Scene 2, which resembles a conference break social interaction scenario similar to the experiment scene of our previous work [
76]. The scene consists a large hall with a red carpet, walls, paintings on the wall, round pub-style tables, and the VAs positioned in a number of groups with varying sizes in different formations.
Figure 8.
The global view of the virtual Scene 2, which resembles a conference break social interaction scenario similar to the experiment scene of our previous work [
76]. The scene consists a large hall with a red carpet, walls, paintings on the wall, round pub-style tables, and the VAs positioned in a number of groups with varying sizes in different formations.
Figure 9.
Images from Scene 1, (a) virtual Pepper robot is facing the scene where VAs are interacting with each other. (b) The egocentric view of the scene from the robot’s camera.
Figure 9.
Images from Scene 1, (a) virtual Pepper robot is facing the scene where VAs are interacting with each other. (b) The egocentric view of the scene from the robot’s camera.
Figure 10.
The global view of Scene 1 along with 100 randomly generated positions for the robot, which are represented by purple-coloured cylindrical spaces. The arrows represent the facing direction of the robot for respective position (best viewed in colour).
Figure 10.
The global view of Scene 1 along with 100 randomly generated positions for the robot, which are represented by purple-coloured cylindrical spaces. The arrows represent the facing direction of the robot for respective position (best viewed in colour).
Figure 11.
The global view of Scene 1, along with 31 positions for the robot and the robot could be seen in one of the purple cylindrical spaces (best viewed in colour).
Figure 11.
The global view of Scene 1, along with 31 positions for the robot and the robot could be seen in one of the purple cylindrical spaces (best viewed in colour).
Figure 12.
Images (a) and (b) are sample images of Scene 1 through robot’s camera from two positions (best viewed in colour).
Figure 12.
Images (a) and (b) are sample images of Scene 1 through robot’s camera from two positions (best viewed in colour).
Figure 13.
The global view of Scene 2, along with 500 randomly generated positions for the robot, which are represented by purple-coloured cylindrical spaces. The arrows represent the facing direction of the robot for respective position (best viewed in colour).
Figure 13.
The global view of Scene 2, along with 500 randomly generated positions for the robot, which are represented by purple-coloured cylindrical spaces. The arrows represent the facing direction of the robot for respective position (best viewed in colour).
Figure 14.
The global view of Scene 1, along with 247 positions for the robot (best viewed in colour).
Figure 14.
The global view of Scene 1, along with 247 positions for the robot (best viewed in colour).
Figure 15.
We organise the presentation of the results according to three main headings: head orientation, group detection, and estimating F-formation. In head orientation and group detection, we present the results first from Scene 1 and then from Scene 2. In estimating F-formation, however, we first present the results from Scene 2, as the classifier was trained on Scene 2. For robust evaluation, the trained classifier was then used on Scene 1 data. A summary that illustrates the presentation of the experimental results is shown.
Figure 15.
We organise the presentation of the results according to three main headings: head orientation, group detection, and estimating F-formation. In head orientation and group detection, we present the results first from Scene 1 and then from Scene 2. In estimating F-formation, however, we first present the results from Scene 2, as the classifier was trained on Scene 2. For robust evaluation, the trained classifier was then used on Scene 1 data. A summary that illustrates the presentation of the experimental results is shown.
Figure 16.
Our approach—V6KP, estimating head orientation of people in Scene 1 from an egocentric view. The letter A stands for facing about, R stands for facing right, C stands for facing centre, and L stands for facing left.
Figure 16.
Our approach—V6KP, estimating head orientation of people in Scene 1 from an egocentric view. The letter A stands for facing about, R stands for facing right, C stands for facing centre, and L stands for facing left.
Figure 17.
Image (a) presents the output image from the robot. The numbers on the VAs in the image represent their tracking ID. Image (b) presents the terminal with results, which present the different groups’ information, i.e., the number of groups, the VAs in the group, their spatial and orientational information, and their tracking ID.
Figure 17.
Image (a) presents the output image from the robot. The numbers on the VAs in the image represent their tracking ID. Image (b) presents the terminal with results, which present the different groups’ information, i.e., the number of groups, the VAs in the group, their spatial and orientational information, and their tracking ID.
Figure 18.
Scene 2 consists of a number and variety of F-formations. The 13 formations are numbered and listed as follows: 1 and 8 are vis-a-vis formations; 2 and 13 are side-by-side formations; 3, 6, and 9 are L-shape formations; 4, 5, 10, and 11 are circular formations; 12 is semi-circular formation; 7 is triangular formation.
Figure 18.
Scene 2 consists of a number and variety of F-formations. The 13 formations are numbered and listed as follows: 1 and 8 are vis-a-vis formations; 2 and 13 are side-by-side formations; 3, 6, and 9 are L-shape formations; 4, 5, 10, and 11 are circular formations; 12 is semi-circular formation; 7 is triangular formation.
Table 1.
Interpersonal distances of people.
Table 1.
Interpersonal distances of people.
Spaces | Distances between People | Interactions between |
---|
Intimate | 0–0.5 m | Couples or partners |
Personal | 0.5–1.2 m | Friends or family |
Social | 1.2–3.7 m | Colleagues or unknown |
Public | above 3.7 m | Speaker and people (public speeches) |
Table 2.
Random positions for the robot in both the scenes.
Table 2.
Random positions for the robot in both the scenes.
Scenes | Random | Retained | One VA | Positions |
---|
Scene 1 | 100 | 31 | 3 | 28 |
Scene 2 | 500 | 276 | 29 | 247 |
Table 3.
Head orientation: ground truth information for both the scenes.
Table 3.
Head orientation: ground truth information for both the scenes.
Scenes | Facing about | Facing Right | Facing Center | Facing Left | Total |
---|
Scene 1 | 49 | 44 | 30 | 45 | 168 |
Scene 2 | 542 | 303 | 274 | 250 | 1369 |
Table 4.
F-formations: ground truth information for both the scenes.
Table 4.
F-formations: ground truth information for both the scenes.
Scenes | Vis-a-Vis | Side-by-Side | L-Shape | Circular | Semi-Circular | Triangular | Total Groups |
---|
Scene 1 | 11 | 0 | 17 | 20 | 0 | 0 | 48 |
Scene 2 | 56 | 87 | 136 | 63 | 28 | 20 | 390 |
Table 5.
Scene 1 estimating the head orientation: confusion matrix and accuracy.
Table 5.
Scene 1 estimating the head orientation: confusion matrix and accuracy.
Facing Direction | About | Right | Center | Left | Unrecognized | Total | Accuracy (%) |
---|
About | 25 | 5 | 6 | 4 | 9 | 49 | 51 |
Right | 4 | 31 | 4 | 0 | 5 | 44 | 70 |
Center | 1 | 2 | 24 | 1 | 2 | 30 | 80 |
Left | 2 | 0 | 3 | 38 | 2 | 45 | 84 |
Table 6.
Scene 2 estimating the head orientation: confusion matrix and accuracy.
Table 6.
Scene 2 estimating the head orientation: confusion matrix and accuracy.
Facing Direction | About | Right | Centre | Left | Unrecognized | Total | Accuracy (%) |
---|
About | 319 | 41 | 99 | 40 | 44 | 542 | 58 |
Right | 23 | 229 | 32 | 1 | 18 | 303 | 75 |
Center | 5 | 15 | 230 | 18 | 6 | 274 | 83 |
Left | 10 | 4 | 23 | 204 | 9 | 250 | 81 |
Table 7.
Scene 1: precision, recall, and F-measure values for groups detected by robot.
Table 7.
Scene 1: precision, recall, and F-measure values for groups detected by robot.
T | | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|
2/3 | Precision | 0.65 | 0.94 | 0.94 | 0.89 | 0.89 | 0.85 | 0.85 | 0.82 | 0.82 | 0.82 | 0.78 |
Recall | 0.64 | 0.85 | 0.85 | 0.80 | 0.80 | 0.76 | 0.76 | 0.72 | 0.70 | 0.70 | 0.67 |
F-measure | 0.65 | 0.89 | 0.89 | 0.84 | 0.84 | 0.81 | 0.81 | 0.76 | 0.76 | 0.76 | 0.72 |
1 | Precision | 0.44 | 0.72 | 0.72 | 0.69 | 0.69 | 0.67 | 0.63 | 0.60 | 0.60 | 0.58 | 0.58 |
Recall | 0.41 | 0.64 | 0.64 | 0.61 | 0.61 | 0.60 | 0.56 | 0.52 | 0.50 | 0.49 | 0.49 |
F-measure | 0.42 | 0.68 | 0.68 | 0.64 | 0.64 | 0.63 | 0.60 | 0.55 | 0.55 | 0.53 | 0.53 |
Table 8.
Scene 2: Precision, recall, and F-measure values for groups detected by robot.
Table 8.
Scene 2: Precision, recall, and F-measure values for groups detected by robot.
T | | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|
2/3 | Precision | 0.69 | 0.80 | 0.80 | 0.79 | 0.78 | 0.78 | 0.77 | 0.76 | 0.76 | 0.76 | 0.72 |
Recall | 0.70 | 0.69 | 0.69 | 0.67 | 0.67 | 0.66 | 0.65 | 0.63 | 0.63 | 0.60 | 0.58 |
F-measure | 0.70 | 0.75 | 0.74 | 0.73 | 0.72 | 0.71 | 0.71 | 0.69 | 0.69 | 0.68 | 0.64 |
1 | Precision | 0.33 | 0.66 | 0.66 | 0.65 | 0.65 | 0.63 | 0.63 | 0.61 | 0.61 | 0.60 | 0.57 |
Recall | 0.34 | 0.57 | 0.56 | 0.55 | 0.55 | 0.53 | 0.52 | 0.50 | 0.50 | 0.48 | 0.45 |
F-measure | 0.33 | 0.61 | 0.60 | 0.59 | 0.59 | 0.58 | 0.57 | 0.55 | 0.55 | 0.53 | 0.50 |
Table 9.
Estimating F-formations in Scene 2: confusion matrix and accuracy.
Table 9.
Estimating F-formations in Scene 2: confusion matrix and accuracy.
F-Formations | Vis-a-Vis | Side-by-Side | L-Shape | Circular | Accuracy | Total |
---|
Vis-a-Vis | 12 | 0 | 0 | 0 | 100% | 12 |
Side-by-Side | 0 | 17 | 1 | 0 | 94% | 18 |
L-shape | 0 | 1 | 27 | 0 | 96% | 28 |
Circular | 0 | 0 | 0 | 13 | 100% | 13 |
Table 10.
Estimating F-formations in Scene 1: confusion matrix and accuracy.
Table 10.
Estimating F-formations in Scene 1: confusion matrix and accuracy.
F-Formations | Vis-a-Vis | Side-by-Side | L-Shape | Circular | Accuracy | Total |
---|
Vis-a-Vis | 9 | 1 | 1 | 0 | 81% | 11 |
Side-by-Side | 0 | 0 | 0 | 0 | -% | - |
L-shape | 6 | 0 | 11 | 0 | 64% | 17 |
Circular | 0 | 0 | 0 | 20 | 100% | 20 |
Table 11.
Estimating constrained F-formations: confusion matrix and accuracy.
Table 11.
Estimating constrained F-formations: confusion matrix and accuracy.
F-Formations | Triangular | Semi-Circular | Circular | Unrecognized | Accuracy | Total |
---|
Triangle | 9 | 0 | 9 | 2 | 45% | 20 |
Semi-Circular | 0 | 28 | 0 | 0 | 100% | 28 |