1.1. Sense of Touch and Vision-Based Tactile Sensing
The sense of touch is an important feedback modality that allows humans to perform many tasks. Consider, for example, the task of inserting a key into a lock. After obtaining an estimate of the keyhole’s position, we rely almost exclusively on the sense of touch to move the key from being in the general vicinity of the keyhole to inserting it in the keyhole. During these fleeting seconds, we rely on tactile feedback to adjust both the position and the orientation of the key until insertion is obtained. More subtly, we also use tactile feedback to adjust how much force is needed to keep the grasped key from slipping. For robots to perform tasks such as grasping [
1,
2], peg-in-hole insertion [
3,
4], and other tasks that require dexterity, it becomes paramount that robotic systems have a sense of touch [
5,
6,
7]. Much work has been conducted on augmenting robots with an artificial sense of touch [
8]. Several tactile sensor conceptions exist within the literature. These include sensors based on transduction (capacitive, resistive, ferromagnetic, optical, etc.) as well as those based on piezoelectric material [
7]. However, these sensors have high instrumentation costs and are thus hard to maintain over long periods of time [
9]. A particularly promising tactile sensing technology is vision-based tactile sensors (VBTSs) [
9].
VBTS systems consist of an imaging device capturing the deformation of a soft material due to contact with the external environment. The soft material is equipped with a pattern that allows the image sensor to capture deformations clearly. Such patterns include small colored markers [
10,
11], randomly dispersed fluorescent markers [
12], and colored LED patterns [
13]. Compared to other tactile sensors, VBTSs do not require much instrumentation; only the imaging device and a source of illumination are required to be instrumented and maintained. This is important for the longevity and usability of the sensor. Although VBTSs have low instrumentation overhead, they provide high-resolution tactile feedback. VBTS output images, and thus can be processed using classical and learning methods for image processing. Some algorithms build on intermediate features such as marker position, optical flow, or depth [
3,
10,
11,
14,
15,
16,
17,
18], while others build and train end-to-end models [
3,
18,
19,
20]. The high resolution of VBTS allows robots to perform tasks such as detecting slip [
2,
21,
22], estimating force distribution [
11,
23], classifying surface textures [
24,
25], and manipulating small and soft objects [
26,
27] as well as many tasks that require dexterity and fine-detailed pose estimation.
In our previous work [
10,
28], we have introduced a multifunctional sensor for navigation and guidance, precision machining, and vision-based tactile sensing. Concretely, the sensor is introduced in the context of the aerospace manufacturing industry, where maintaining high precision while machining is imperative. The proposed sensor configuration is depicted in
Figure 1. When the hatch is open by the servo motor, the camera used for VBTS is used to navigate to the desired position. Once the target position is achieved, the hatch is closed, and the deburring or drilling process starts. The tactile feedback from the VBTS is used to ensure that the robot is machining while maintaining perfect perpendicularity to the workpiece. The video stream of the imaging device is processed by a convolutional neural network (CNN). Ensuring perpendicularity is crucial when working in the aerospace manufacturing industry [
29,
30,
31]. Failure to abide by perpendicularity requirements leads to an increase in bending stress and a decrease in fatigue life, thus lowering the reliability of aircraft parts [
32,
33]. Several other works in the literature use VBTS for contact angle prediction [
19,
34,
35]. Lepora et al. [
19] use a CNN on binarized images from a camera to estimate the contact pose. Psomopoulou et al. [
34] also use a CNN to estimate the contact angle while grasping an object. Finally, Ref. [
35] extract the markers’ positions using blob detection, then construct a 2D graph with markers as nodes using Delaunay triangulation. The graphs are then processed using a graph neural network to predict the contact angle. All of the aforementioned work uses conventional VBTS, thus requiring internal illumination as well as having a limited temporal update-rate.
Most VBTSs, including [
10,
19,
28,
34,
35], rely on standard frame-based cameras. These cameras capture full-resolution frames consisting of pixel intensities at a fixed synchronous rate. Consequently, even when there are no changes in the scene, the frame-based cameras continue to capture frames, leading to redundant processing of unchanged pixels between consecutive frames. This redundancy does not contribute additional knowledge for downstream tasks. Moreover, the synchronous nature of frame-based cameras poses challenges when operating in scenarios that require fast perception and action. The latency due to the exposure time of synchronous cameras implies a delay in the action of the robot. For instance, in drilling and deburring tasks, it was essential to swiftly perceive unexpected events, such as a deviation from perpendicularity, and react promptly to prevent damage to the machine and workpiece [
36,
37]. High-speed cameras can reduce the latency due to exposure time; however, they result in more data that requires higher bandwidth to transfer, and more processing power, which ultimately incurs latency. Thus, in VBTS utilizing framed-based cameras, a compromise between exposure time latency and processing latency exists. Therefore, an asynchronous, low-bandwidth, and fast update-rate sensor is needed. The emergence of neuromorphic event-based cameras, driven by advancements in vision sensor technologies, has addressed these limitations and has become a critical tool for achieving accurate visual perception and navigation.
The event-based camera, a bio-inspired device, offers unprecedented capabilities compared to standard cameras, including its asynchronous nature, high temporal resolution, high dynamic range, and low power consumption. Therefore, by utilizing an event camera instead of a standard camera, we can enhance the potential of our previous sensor and improve navigation or machining performance in challenging illumination scenarios. In this work, we use an event-based camera for VBTS. This allows us to obtain less expensive computations, fast update rate, and relinquish the need for internal illumination, which adds instrumentation complexity.
1.2. Neuromorphic Vision-Based Tactile Sensing
Neuromorphic cameras (also known as event-based cameras) are a relatively new technology, first introduced in [
38], that aim to mimic how the human eye works. Neuromorphic cameras report intensity changes, at the pixel level, in the scene in an asynchronous manner, rather than report the whole frame at a fixed rate. This mode of operation makes event-based cameras exhibit no motion blur. The pixel-wise intensity changes, called events or spikes, are recorded at a temporal resolution on the order of microseconds. Event-based cameras have been applied in autonomous drone racing [
39], space imaging [
40], space exploration [
41], automated drilling [
42], and visual servoing [
43,
44]. Neuromorphic cameras’ fast update rate, along with their high dynamic range (140 dB compared to conventional cameras with 60 dB [
45]) and low power consumption, make them apt for robotics tasks [
46]. Therefore, several studies have proposed the use of neuromorphic event-based cameras for vision-based tactile sensing (VBTS) [
1,
16,
20,
21,
47,
48,
49,
50]. In particular, event-based cameras are capable of providing adequate visual information in challenging lighting conditions without requiring an additional light source, owing to their high dynamic range. Due to not needing a source of illumination, a VBTS system that utilizes an event-based camera will have a lower instrumentation cost and thus require less maintenance in the long run. Specifically, the instrumentation cost and complexity of the tactile sensor include the cables, powering circuit, maintenance, and replacement of defective parts over the sensor’s lifetime. Such tactile sensor configuration utilizing an event-based camera would reduce instrumentation complexity, such as having fewer cables and a smaller power circuit, and hence require less maintenance and replacement of defective parts. While some VBTSs use a semitransparent, transparent, or translucent tactile surface to overcome the need for a source of illumination [
9,
13,
51,
52], this will make training end-to-end machine learning models difficult as the camera will capture extraneous information from the environment making it dependent on the object the sensor is contacting and the environment, thus limiting generalization. Event-based cameras allow us to overcome the instrumentation and maintenance costs of having a source of illumination while still maintaining the potential for training end-to-end models. As it currently stands, event-based cameras are a new technology which are still not in mass production, making the price of available cameras in the order of thousands of dollars. However, as event cameras gain prominence and enter mass production, the price is expected to decrease significantly over the next five years [
46]. This is exemplified in the consumer-grade mass-produced event-based camera by Samsung, which sells for USD 100 [
46,
53], a price comparable to conventional cameras. These features of event-based cameras make them an attractive choice for VBTS. However, dealing with event-based data still poses a challenge, as will be discussed in the following subsection.
1.3. Challenges in Event-Based Vision and Existing Solutions
The temporally dense, spatially sparse, and asynchronous nature of event-based streams pose a challenge to traditional methods of processing frame-based streams. Early work on neuromorphic vision-based tactile sensing (N-VBTS) constructs images from event streams by accumulating events over a period of time and applying image-processing techniques. Such approaches are called event-frame methods. These approaches usually use synchronous algorithms and apply them over constructed frames sequentially; thus, event-frame approaches do not exploit the temporal density and spatial sparsity of event streams. For instance, Amin et al. [
47] detect the incipient slip of a grasped object by applying morphological operations over event-frames and monitoring blobs in the resulting frame. This approach is not asynchronous and does not generalize well to tasks beyond slip detection. Ward-Cherrier et al. [
16] construct encodings relevant to the markers’ position of the tactile sensors and then use a classifier to detect the object’s texture in contact. Their algorithm iteratively updates marker positions using events generated around markers. This method is synchronous and is susceptible to high noise, especially when there is no illumination. Furthermore, if there is a lot of motion, the estimated marker positions drift away from the actual marker positions. Fariborz et al. [
48,
49] use Conv-LSTMs on event-frames constructed from event streams to estimate contact forces. Faris et al. [
22] uses CNN over accumulated event heatmaps to detect slip. This approach is not asynchronous and hence has downtime between constructed event-frames. To our knowledge, the only asynchronous deep learning method that makes use of spatial sparsity and temporal density applied in the N-VBTS setting is the work of MacDonald et al. [
50].
Spiking neural networks (SNNs) are computational models inspired by the brain’s neural processes. They utilize event- or clock-driven signals to update neuron nodes based on specific parameters, using discrete spike trains instead of continuous decimal values for information transfer [
54]. This biologically-inspired approach offers a more intuitive and simpler inference and model training compared to traditional networks [
55]. Building on [
16]’s NeuroTac, they propose using an SNN to determine the orientation of contact with an edge. While this is a promising step towards neuromorphic tactile sensing, SNNs are trained in an unsupervised manner. Another classifier is run on top of the SNN to make predictions. However, this approach does not generalize well beyond simple tasks. Furthermore, training SNNs is still challenging due to their non-differentiable nature and their requiring larger amounts of data for effective training due to the sparsity of spike events. This limitation can restrict their usability in domains with limited data availability. Additionally, SNNs require neuromorphic computing hardware for effective event-based processing [
56,
57].
Outside the N-VBTS literature, event-frame and voxel methods also persist [
45,
58,
59,
60,
61]. An emerging line of research investigates the use of graph neural networks (GNNs) to process event streams [
62,
63,
64,
65,
66]. GNNs operate on graphs by learning a representation that takes into account the graph’s connectivity. This representation can be used for further processing via classical machine and deep learning methods. GNNs generalize convolutional networks for irregular grids and networks [
67]. By constructing a graph over events from an event-based camera, GNNs can perform spatially sparse and temporally dense convolutions. GNNs can also operate in an asynchronous mode by applying the methods proposed in [
68,
69] to match the nature of event-based streams. This mode of operation ensures that calculations only occur when there are events, as opposed to event-frame methods. The earliest work utilizing GNNs for event streams, [
62], investigates object classification on neuromorphic versions of popular datasets such as Caltech101 and MNIST. Other works also tackle object detection and localization [
62,
63,
68]. Alkendi et al. [
66] use a GNN fed into a transformer for event stream denoising. Furthermore, [
70] shows that GNNs work well in object detection tasks while performing considerably fewer floating point operations per event compared to CNNs operating on event-frames.
Graphs inherently do not encode geometric information pertaining to their nodes. They only encode information concerning the topological relationships between the nodes as well as the node features. Accordingly, constructing useful and meaningful representations of event data requires more than just the topological structure of a graph. Thus, it becomes imperative to choose an appropriate message-passing algorithm that encapsulates the geometry of events for exploiting the spatiotemporal correlations between events. Several graph geometric deep learning methods have been applied to event-based data in the literature. These include mixture model network (MoNet), graph convolutional networks (GCN), SplineConv, voxel graph CNNs, and EventConv [
62,
63,
64,
65,
66]. The capability of SplineConv has been proved to operate asynchronously on event streams as proposed by [
70]. Moreover, SplineConv has been shown to perform better and faster than MoNet as demonstrated in [
64]. In addition, SplineConv has been verified to be more expressive than GCNs, which can only use one-dimensional features [
71,
72]. In the case of geometric graphs, this feature is usually taken as the distance between nodes. This is problematic for two reasons: (1) messages shared from two equidistant nodes will be indistinguishable and (2) the messages will be rotation invariant and will hence lose all information about orientation.