Next Article in Journal
Effects of Gas and Steam Humidity on Particulate Matter Measurements Obtained Using Light-Scattering Sensors
Previous Article in Journal
Reweighted Off-Grid Sparse Spectrum Fitting for DOA Estimation in Sensor Array with Unknown Mutual Coupling
Previous Article in Special Issue
Object Detection of Flexible Objects with Arbitrary Orientation Based on Rotation-Adaptive YOLOv5
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Extreme Early Image Recognition Using Event-Based Vision

Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha P.O. Box 34110, Qatar
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Sensors 2023, 23(13), 6195; https://doi.org/10.3390/s23136195
Submission received: 30 March 2023 / Revised: 11 May 2023 / Accepted: 11 May 2023 / Published: 6 July 2023
(This article belongs to the Special Issue Image Processing and Analysis for Object Detection)

Abstract

:
While deep learning algorithms have advanced to a great extent, they are all designed for frame-based imagers that capture images at a high frame rate, which leads to a high storage requirement, heavy computations, and very high power consumption. Unlike frame-based imagers, event-based imagers output asynchronous pixel events without the need for global exposure time, therefore lowering both power consumption and latency. In this paper, we propose an innovative image recognition technique that operates on image events rather than frame-based data, paving the way for a new paradigm of recognizing objects prior to image acquisition. To the best of our knowledge, this is the first time such a concept is introduced featuring not only extreme early image recognition but also reduced computational overhead, storage requirement, and power consumption. Our collected event-based dataset using CeleX imager and five public event-based datasets are used to prove this concept, and the testing metrics reflect how early the neural network (NN) detects an image before the full-frame image is captured. It is demonstrated that, on average for all the datasets, the proposed technique recognizes an image 38.7 ms before the first perfect event and 603.4 ms before the last event is received, which is a reduction of 34% and 69% of the time needed, respectively. Further, less processing is required as the image is recognized 9460 events earlier, which is 37% less than waiting for the first perfectly recognized image. An enhanced NN method is also introduced to reduce this time.

1. Introduction

The rapid development and integration of artificial intelligence with image sensors has revolutionized machine vision. Real-time image recognition is an essential task for many applications, such as emerging self-driving vehicles. These applications require continuous and fast image acquisition combined with computationally intensive machine learning techniques such as image recognition. The use of frame-based cameras in such applications introduces four main challenges including: (1) high bandwidth consumption due to large amounts of data transmission; (2) large memory requirement for data storage prior to processing; (3) computationally expensive algorithms for real-time processing; (4) large power and energy consumption for continuous data transmission and processing.
Unlike frame-based cameras, which are based on the concept of sequentially acquiring frames, an event-based imager generates a series of asynchronous events reflecting the change in light intensity per pixel. This concept is derived from the operation of biological vision systems; specifically, the retina. The first functional model to simulate the magno cells of the retina was introduced in 2008 by Tobi’s group under the term dynamic vision sensors (DVSs) [1]. An event-based imager, also known as a silicon retina, dynamic vision sensor (DVS), or neuromorphic camera, is a biologically inspired vision system that acquires visual information in a different way than conventional cameras. Instead of capturing absolute brightness of full images at a fixed rate, these imagers asynchronously respond to changes in brightness per pixel. An “event” is generated if the change in brightness at any pixel surpasses a user-defined threshold. The output of the sensor is a series of digital events <(x, y), I, t> (or spikes) that includes the pixel’s address ( x , y ) , time of event (t), and sign of change in brightness (I) [2,3,4,5,6]. Event-based imagers present several major advantages over normal cameras, including lower latency and power consumption, as well as higher temporal resolution and dynamic range [6]. This allows them to record well in both very dark and very bright scenes.
The novel design of event-based imagers introduces a paradigm shift in the camera sensor design. However, because these sensors produce different data outputs, current image recognition algorithms for conventional cameras are not suitable for them. To our knowledge, there does not yet exist an ideal solution for extracting information from the events produced by the sensors. A few image recognition algorithms have been introduced in the literature, but they are still far from mature [7].
Event-based imagers are capable of overcoming the lost time between frames; hence, they are able to process the information in the “blind” time between each frame. The data collected from event-based imagers have a significantly different structure compared to frame-based imagers. To effectively extract useful information and utilize the full potential of the asynchronous, sparse, and timed data collected from event-based images, we need to either design new processing algorithms or adapt and/or re-design existing vision algorithms for this purpose. The first approach is expected to provide more reliable and accurate results; yet, most existing solutions in the literature follow the second approach. In particular, events are either processed as: (i) a single event, which updates the output of the algorithm based on every new event received, minimizing latency, or (ii) a group of events, which updates the output of the algorithm after a group of events has arrived by using a sliding window. These methodologies are selected based on how much additional information is required to perform a task. In some cases, a single event cannot be useful on its own due to having little information or a lot of noise; hence, additional information is required in the form of either past events or new knowledge.
Frame-based algorithms have advanced to learn features from data using deep learning. For instance, convolutional neural networks (CNNs) are a mature approach for object detection; therefore, many works utilize CNNs and apply them to event-based data. Such works can be divided into two categories: methods that use a frame-based CNN directly [8,9,10,11] and methods that rewire a CNN to take advantage of the structure of event-based data [11,12]. Recognition can sometimes be applied to events that are transformed to frames during inference [13,14], or by converting a trained neural network to a spiking neural network (SNN) which can operate on the event-based data directly [15,16,17].
In this paper, we propose an innovative image recognition technique that operates on image events rather than frame-based data, paving the way to a new paradigm of recognizing objects prior to image acquisition. A faster object recognition technique is proposed using event-based imagers instead of frame-based imagers to achieve on-the-fly object recognition. Using event sequences from the event-based imager, objects are recognized on-the-fly before waiting for the full-frame image to appear. To the best of our knowledge, this is the very first time such a concept is being introduced, other than our initial work in [18], which not only achieves an early image recognition, but also addresses all four challenges mentioned previously, as less data will be transmitted and processed, enabling faster object recognition for applications with real-time constraints. This work explores dataset acquisition and labeling requirements and methodologies using an event-based imager (Celepixel). It adapts existing frame-based algorithms to implement early image recognition by testing the concept on both publicly available datasets and the datasets collected in this work using Celepixel. It also explores enhancing the algorithms to achieve even earlier image recognition.
The rest of the paper is organized as follows. Section 2 introduces the concept of early recognition. Section 3 explains and analyzes the datasets used to validate our concept. Section 4 describes the proposed early recognition algorithm and testing metrics. Section 5 presents and analyzes our experimental results. Section 6 describes an enhanced early recognition method and presents the results. Finally, Section 7 concludes this work.

2. Early Recognition

The main idea of this work is to utilize event-based imagers to achieve early image recognition, as illustrated in Figure 1. The concept of early recognition is defined as accurate image recognition before the last event is received from the event-based imager. In other words, this implies the ability to detect an object without waiting for the full picture to be captured. The idea is derived from the main feature of event-based imagers, where each pixel fires asynchronously in response to any brightness change. An “event” is generated if the change in brightness at any pixel surpasses a user-defined threshold. The output of the sensor is a series of digital events in the form of ( x , y ) , I , t that includes the pixel’s location ( x , y ) , time of event (t), and the sign of change in brightness (I). We aim to process these events as they arrive in real time to perform recognition, which enables us to process the data faster, as there is no longer frame rate limitations; meanwhile, redundant background data are also eliminated.
To achieve early recognition, existing frame-based algorithms can be used and adapted to work with event data. The data used to train the algorithm can be normal images (frame-based) or event data. The data need to be pre-processed to match the input of the algorithm selected, which includes operations such as resizing, compression, noise reduction, etc.
To test the concept, the events are fed to the algorithm as they arrive, and two main metrics are evaluated:
  • First Zero Event (FZE): the first time that the algorithm is able to obtain a zero error rate for the input (this condition starts with the first pixel change and ends before the full image is detected by the sensor).
  • First Perfect Event (FPE): the first time that the algorithm is able to obtain a zero error rate and a confidence level of more than 0.95 (this condition starts after the FZE is detected).
The first zero metric is used to determine the time when the algorithm can guess what the displayed image will be.
In this work, our collected dataset using CeleX (CeleX-MNIST) in Section 3.1 and five different public datasets (MNIST-DVS [19], FLASH-MNIST [19], N-MNIST [20], CIFAR-10 [21], and N-Caltech-101 [20]), collected using different image sensors, are utilized to perform experiments. Two different types of neural networks (InceptionResNetV2 and ResNet50) are trained on the original images (MNIST [22], CIFAR-10 [23], and Caltech-101 [24]), and then tested on the above-mentioned event-based images to demonstrate the ability of early recognition on event-based data. The recognition is then enhanced by training the same neural network on noisy images, referred to in this work as partial pictures (PP).

3. Data Acquisition and Analysis

This section discusses the method to collect each dataset. Moreover, each dataset has different statistical properties that change based on the recording method, sensor sensitivity, and the data being captured. In this section, we explain how the five datasets selected are analyzed to identify their properties and statistical differences, which are summarized in Table 1.
Basic statistical analysis has been performed to identify how long each recording is (in ms) and obtain the total average for each dataset. This helps in calculating how long each saccade (in ms) takes as part of the recording, whether created by either sensor or image movement. Further analysis is conducted to calculate the average number of events per recording, which is affected by the image size, details within each image, and the sensor’s sensitivity. The ON and OFF events average is also calculated. The ranges of the x- and y-addresses are also calculated to make sure that there are no fault data outside of the sensor size.
The datasets are arranged from least to most complex. MNIST is considered one of the basic datasets as it only includes numbers. CIFAR10 has more details within each image, and yet only contains 10 classes. Caltech101 is the most complex as it contains a large number of classes and each image has detailed objects and backgrounds.

3.1. CeleX-MNIST

MNIST-CeleX is collected in this work using the CeleX sensor, which is a 1 Mega Pixel (800 Rows × 1280 Columns) event-based sensor designed for machine vision [25]. It can produce three outputs in parallel: motion detection, logarithmic picture, and full-frame optical flow. The output of the sensor can be either a series of events that are produced in an asynchronous manner or synchronous full-frame images. The sensor can be configured in many modes including: event off-pixel timestamp, event in-pixel timestamp, event intensity, full picture, optical flow, or multi-read optical-flow. Moreover, within the modes, the sensor is able to generate different event image frames: event binary pic, event gray pic, event accumulated gray pic, or event count pic. The output data contain different information depending on the mode selected, including: address, off-pixel timestamp, in-pixel timestamp, intensity, polarity, etc.  [26].
To collect the dataset, the CeleX sensor is mounted on a base opposite a computer screen that displays the dataset as shown in Figure 2. While collecting the dataset, the environment around the imager must be controlled in order not to allow any glare or reflection from the screen. In Figure 3, the difference between a controlled well-lit environment vs. an environment with flickering lights is displayed. In order to avoid any artifacts and false pixel changes, the same stable conditions should be used throughout the data collection.
To capture the MNIST, the 600 training samples of MNIST were scaled to fit the sensor sizes and flashed on an LCD screen. As noted in Table 1, both the total time average and Saccade time average of the recordings was 631 ms, as we only flashed the image once. The size of the full sensor is shown in the min and max values; however, the image is only shown on an estimate of 800 × 800 of the sensor.
Algorithm 1 explains in detail the methodology followed in this work for data acquisition. Once the mode is selected, each image is scaled to 800 × 800, then flashed once and followed by a black image. Resetting the scene to black allows the sensor to detect the change in the flashed image only. The dataset consists of 600 recordings for 10 classes (digits 0–9). Each collected event in the recording includes five pixel information:
  • Row address: range 0 to 799;
  • Column address: range 0 to 1279;
  • Event timestamp: range 0 to 2 31 in microseconds;
  • Intensity polarity: −1: OFF, 0: unchanged, +1: ON;
  • Intensity: range 0 to 4095.

3.2. MNIST-DVS

The dataset [19] was created using a 128 × 128 event-based imager [27]. To obtain the dataset, the original test set (10,000 images) of the MNIST dataset was scaled to three different sizes (4, 8, and 16). Each scaled digit was then displayed slowly on an LCD screen and captured by the imager.
The dataset consists of 30,000 recordings, 10,000 per scale, for 10 classes (digit 0–9). Each collected event in the recording includes four pixel attributes:
  • Row address: range 0 to 127;
  • Column address: range 0 to 127;
  • Event timestamp: in microseconds;
  • Intensity polarity: −1: OFF, +1: ON.
Algorithm 1 CeleX-MNIST dataset acquisition using CelePixel
Sensor Mode: 
intensity
Picture Mode: 
gray picture
Input: 
image matrix
Output: 
raw events
     
INITIALIZATION:
  1:
load dataset
     
LOOP PROCESS
  2:
for  i = 0 to 599 do
  3:
   display black image
  4:
   pause for 0.2 s
  5:
   start collecting events
  6:
   scale image to 800 × 800
  7:
   display image
  8:
   label event
  9:
   pause for 0.2 s
10:
   stop collecting events
11:
   display black image
12:
   pause for 0.2 s
13:
   export collected events
As explained in [19], to capture the MNIST-DVS, the 10,000 training samples of MNIST were scaled to three different sizes and displayed slowly on an LCD screen. It can be observed from Table 1 that as the scale of the image increases, the number of events produced by the imager increase as well, ranging from 17,011 events per image to 103,133 events. However, the recording period is almost similar for all three scales with an average of 2347 ms. As the number of saccade movements performed to create the movement was not mentioned in [19], we assumed that one recording is one saccade.

3.3. FLASH-MNIST

The dataset [19] is created using a 128 × 128 event-based imager [27]. To obtain the dataset, each of the 70,000 images in the original MNIST dataset is flashed on a screen. In particular, each digit is flashed five times on an LCD screen and captured by the imager.
The dataset consists of 60,000 training recordings and 10,000 testing recordings, with 10 classes (digits 0–9). Each collected event in the recording includes four pieces of pixel information:
  • Row address: range 1 to 128;
  • Column address: range 1 to 128;
  • Event timestamp: in microseconds;
  • Intensity polarity: 0: OFF, +1: ON.
The Flash-MNIST dataset is divided into training and testing recordings. The average total time for both datasets is 2125 ms. As explained in [19], each image is flashed for five times on the screen. Each saccade duration is an average of 425 ms.

3.4. N-MNIST

The dataset [20] is created using a 240 × 180 event-based imager ATIS [28]. To capture the dataset, this work uses an imager mounted on a motorized pan-tilt unit. The imager is mounted on the motorized system consisting of two motors and positioned in front of an LCD monitor, as shown in Figure 4. To create movements, the imager moves up and down (three micro-saccades), creating motion and capturing the images on the monitor.
The original images were resized, while maintaining the aspect ratio, to ensure that the size does not exceed 240 × 180 pixels (ATIS size), before being displayed on the screen. The MNIST were resized to fill up 28 × 28 pixels on the ATIS sensor. The dataset consists of 60,000 training recordings and 10,000 testing recordings for 10 classes (digits 0–9). Each collected event in the recording includes four aspects of pixel information:
  • Row address: range 1 to 34;
  • Column address: range 1 to 34;
  • Event timestamp: in microseconds;
  • Intensity polarity: +1: OFF, +2: ON.
The imager used to record N-MNIST [20] is 240 × 180; however, the dataset is recorded only with 34 × 34 pixels. The dataset is divided into training and testing recordings. Each recording contains three saccades, each with a duration of 102 ms, leading to a total recording for a single image of 306 ms. Compared to MNIST-DVS and FLASH-MNIST, this dataset has a very low average number of events considering its small pixel size.

3.5. CIFAR-10

The dataset [21] is created using a 128 × 128 event-based imager [1], as shown in Figure 5A. To capture the dataset, a repeated closed-loop smooth (RCLS) image movement method is used. The recording setup is placed inside a dark compartment and does not require any motors or control circuits. The recording starts with an initialization stage that loads all the data. Then, each loop in the RCLS has four moving paths at an angle of 45 degrees, as shown in Figure 5B, and the full loop is repeated six times. A 2000 ms wait is required between every image so that the next recording is not affected.
The original images were upsampled, while maintaining the aspect ratio, from 32 × 32 to 512 × 512. The dataset consists of 10,000 recordings, as images were randomly selected from the original dataset with 1000 image per class. Each collected event in the recording includes four pixels of information:
  • Row address: range 0 to 127;
  • Column address: range 0 to 127;
  • Event timestamp: in microseconds;
  • Intensity polarity: −1: OFF, +1: ON.
In this dataset [21], the recordings are created by moving the image on the screen to four locations and repeating this loop six times; hence, having 24 saccades. Each saccade lasts for 54 ms and adds up to an average total time of 1300 ms per recording. The event count is very high compared to MNIST-DVS which has the same 128 × 128 size. The reason behind the increase in events generated by the imager is that the CIFAR10 dataset has images with complex details and backgrounds, unlike MNIST which only has numbers.

3.6. N-Caltech 101

The dataset [20] was created using a 240 × 180 event-based imager ATIS [28]. The dataset was captured using the same recording technique explained for N-MINST.
The original images vary in size and aspect ratio. However, every image was scaled as large as possible, while maintaining the aspect ratio, to ensure the size did not exceed 240 × 180 pixels (ATIS size) before being displayed on the screen. The dataset consists of 8709 recordings for 10 classes. Each collected event in the recording includes four pixels of information:
  • Row address: range 1 to 180;
  • Column address: range 1 to 240;
  • Event timestamp: in microseconds;
  • Intensity polarity: +1: OFF, +2: ON.
This dataset was recorded using the same method and imager as the N-MNIST. However, for N-Caltech 101 [20], the full imager size is used, as these images are bigger and have more details. Each recording contains three saccades, each with a duration of 100 ms, which creates a total recording for a single image of 300 ms. Due to using the full sensor size and the number of details in these images, it is noticed that the number of events is almost 28 times more than N-MNIST.

4. Early Recognition Method

An existing image recognition algorithm was used with a group events methodology, as described in Section 1, which waits for a group of events to occur then passes the data to the recognition algorithm. This section describes the network architectures used for early image recognition as well as the testing methodology and metrics.

4.1. Neural Network

Two network architectures are used to process the data.

4.1.1. InceptionResNetV2

The network architecture used is shown in Figure 6 [29], which consists of stem block, Inception-Resnet (A, B, and C) with reduction, average pooling, dropouts, and fully connected output layers. The InceptionResNetV2 is pre-trained on ImageNet [30], which consists of 1.2 million images. The original network has an output of 1000 classes; hence, a new output layer is added and trained on the original dataset (MNIST or CIFAR-10) to be tested. All inputs are pre-processed by being resized to match the image size that the network has been trained on.

4.1.2. ResNet50

The network architecture used is shown in Figure 7 [31], which consists of convolutional, average pooling, and fully connected output layers. The ResNet50 is pre-trained on ImageNet [30], which consists of 1.2 million images. The weights of the ResNet50 convolutional layers are frozen, while the fully connected output layer is removed. A new output layer is added and trained on the original dataset (Caltech-101) to be tested. All inputs are pre-processed by being resized to match the image size that the network has been trained on.

4.2. Testing on Collected Events

  • Input Preprocessing
    The size of the events collected is not fixed, and thus it is resized to match the input of the neural network architecture. As the dataset is very big, the recognition is conducted after every group of event, noting that time will be more accurate if event-by-event methodology was used instead of group events.
  • Testing and Metrics
    Algorithm 2 describes the testing algorithm. The events are feedforwarded to the neural network model after each group of events and then two main metrics, FZE and FPE, explained in Section 2, are calculated.
Algorithm 2 Event-based dataset testing on NN architecture
Input: 
raw events from dataset
Output: 
FZE, FPE
     
INITIALIZATION:
  1:
load trained weight of network model
  2:
load raw events
  3:
initialize a matrix with zeros (image)
     
DATA PREPROCESSING:
  4:
resize raw events to match input
     
LOOP PROCESS
  5:
for  i = 1 to e v e n t s   c o u n t  do
  6:
     update the image with event i
  7:
     if  ( i   m o d   # e v e n t P e r G r o u p ) = 0 then
  8:
       test the image in the neural network
  9:
       if  c l a s s _ c o d e   i s   c o r r e c t & F Z E   f l a g = 0 then
10:
          FZE equals i
11:
          FZE flag equals 1
12:
       if  c l a s s _ c o d e   i s   c o r r e c t & p r o b a b i l i t y   > =   0.95 & F P E   f l a g = 0 then
13:
          FPE equals i
14:
          FPE flag equals 1
15:
calculate the difference between FZE and FPE

5. Early Recognition Analysis and Results

5.1. CeleX-MNIST

As stated in Section 4, the InceptionResNetV2 is trained on the original MNSIT dataset. To calculate its accuracy, the trained model is tested on 10,000 images of the original MNIST dataset and reports an accuracy of 99.07%.
The 600 raw event-based recording images are selected from the CeleX-MNIST dataset. The set contains recordings with events collected on an average time of 631 (ms) and an average event count of 420,546 per recording. The size of the events collected is 800 × 1280 pixels, so it is first cropped to 800 × 800 then resized to 28 × 28 to match the input of the InceptionResNetV2 architecture. As the dataset is very big, the recognition is conducted every 1000 events. The results are analyzed as per the testing metrics described in Table 2.
  • Average Results: As shown in Table 3, on average the images are detected 14.81 (ms) earlier, which is around 28.78% before the full image is displayed. In terms of event sequence, the image is detected around 18.92% before, or 32,558 events before the full image is accumulated.
    Table 4 summarizes the testing metrics per image category; only three categories are reported here for reference.
  • Sample test image results: Figure 8 illustrates sample test images at the first zero, first perfect, and saccade events. It can be noticed that at the FZE images, the details of the image are not yet displayed; however, the network can still recognize the image with an average probability of 56.20%. The probability keeps increasing as more events are processed, as illustrated in Figure 9.
    The result of processing a single raw image file from class (2), which is shown in Figure 8 (first row), is discussed here in detail. The selected raw file contains an event count of 742,268 events which are collected at a duration of 479.95 (ms). As the events are processed, the image is updated and feedforwarded to the InceptionResNetV2 trained network. The network predicts the category of the image and provides probability against the 10 classes. Figure 9 illustrates the probability of the image (black line) against the time sequence. As discussed above, as more events are processed, the probability increases. For this image, the network is able to detect the zero and perfect event, as described in Table 5.

5.2. MNIST-DVS

The same trained neural network in Section 5.2 is used to test this datatset.
The 10,000 (scale 16) raw event-based recording images are selected from the MNIST-DVS dataset. The set contains recordings with events collected on an average time of 2411.81 (ms) and an average event count of 103,144 per recording. The size of the events collected is 128 × 128 pixels, so it is resized to 28 × 28 to match the input of the InceptionResNetV2 architecture. As the dataset is very big, the recognition is conducted every 50 events. The results are analyzed as per the testing metrics described in Table 2.
  • Average results: As shown in Table 6, on average the images are detected 108.69 (ms) earlier, which is around 35.51% before the full image is displayed. In terms of event sequence, the image is detected around 34.57% earlier, or 3400 events before the full image is accumulated.
    Table 7 summarizes the testing metrics per image category; only three categories are reported here for reference.
  • Sample test image results: Figure 10 illustrates sample test images at the first zero, first perfect, and end of saccade events. It can be noticed that at the FZE images, the details of the image are not yet displayed; however, the network can still recognize the image with an average probability of 61.65%.

5.3. FLASH-MNIST

The same trained neural network in Section 5.2 is used to test this datatset.
The 10,000 (test dataset) raw event-based recording images are selected from the FLASH-MNIST dataset. The set contains recordings with events collected on an average time of 2103.25 (ms) and an average event count of 27,321 per recording. The size of the events collected is 128 × 128 pixels, so it is resized to 28 × 28 to match the input of the InceptionResNetV2 architecture. As the dataset is very big, the recognition is conducted every 50 events. The results are analyzed as per the testing metrics described in Table 2.
  • Average results: As shown in Table 8, on average the images are detected 5.76 (ms) earlier, which is around 1.72% before the full image is displayed at the end of the saccade. In terms of event sequence, the image is detected around 8.43% earlier, or 1883 events before the end of the saccade image is accumulated. It is also noted that the average zero time and perfect time are both below 420.65 (ms), which is the duration of the first saccade.
    Table 9 summarizes the testing metrics per image category; only three categories are reported here for reference.
  • Sample test image results: Figure 11 illustrates sample test images at the first zero, first perfect, and end of saccade events. It can be noticed that with the FZE images, the details of the image are not yet displayed; however, the network can still recognize the image with an average probability of 66.80%. The probability keeps increasing as more events are processed, as illustrated in Figure 12.
    The result of processing a single raw image file from class (0), which is shown in Figure 11 (first row), is discussed here in detail. The selected raw file contains an event count of 38,081 events which are collected on a duration of 2095.34 (ms). Each saccade is 419.07 (ms). As the events are processed, the image is updated and feedforwarded to the InceptionResNetV2 trained network. The network predicts the category of the image and provides probability against the 10 classes. Figure 12 illustrates the probability of the image (black line) against the time sequence. As discussed above, as more events are processed, the probability increases. For this image, the network is able to detect the zero and perfect event as described in Table 10.

5.4. N-MNIST

The same trained neural network in Section 5.2 is used to test this datatset. The 10,000 (test dataset) raw event-based recording images are selected from the N-MNIST dataset. The set contains recordings with events collected on an average time of 306.20 (ms) and an average event count of 4204 per recording. The size of the events collected is 34 × 34 pixels, so it is resized to 28 × 28 to match the input of the InceptionResNetV2 architecture. As the dataset is smaller than previous ones, the recognition is conducted every 10 events. The results are analyzed as per the testing metrics described in Table 2.
  • Average results: As shown in Table 11, on average the images are detected 7.89 (ms) earlier, which is around 19.91% before the full image is captured perfectly in FPE. In terms of event sequence, the image is detected around 32.26% earlier, or 167 events before the end of the image is accumulated in FPE. It is also noted that the average zero time and perfect time are both below 102.07 (ms), which is the duration of the first saccade.
    Table 12 summarizes the testing metrics per image category; only three categories are reported here for reference.
  • Sample test image results: Figure 13 illustrates sample test images at the first zero, first perfect, and end of saccade events. It can be noticed that with the FZE images, the details of the image are not yet displayed; however, the network can still recognize the image with an average probability of 67.09%. The probability keeps increasing as more events are processed as illustrated in Figure 14.
    The result of processing a single raw image file from class (3), which is shown in Figure 13 (second row), is discussed here in detail. The selected raw file contains an event count of 5857 events which are collected at a duration of 306.17 (ms). Each saccade is 102.07 (ms). As the events are processed, the image is updated and feedforwarded to the InceptionResNetV2 trained network. The network predicts the category of the image and provides probability against the 10 classes. Figure 14 illustrates the probability of the image (black line) against the time sequence. As discussed above, as more events are processed, the probability increases. For this image, the network is able to detect the zero and perfect event as described in Table 13.

5.5. CIFAR-10

As stated in Section 4, the InceptionResNetV2 is trained on the original CIFAR-10 dataset. To calculate its accuracy, the trained model is tested on 10,000 of the original CIFAR-10 dataset and reports an accuracy of 90.18%.
The 5509 (test dataset) raw event-based recording images are selected from the CIFAR-10 DVS dataset. The set contains recordings with events collected on an average time of 1300 (ms) and an average event count of 189,145 per recording. The size of the events collected is 128 × 128 pixels, so it is resized to 32 × 32 to match the input of the InceptionResNetV2 architecture. As the dataset is very large, the recognition is conducted every 100 events. The results are analyzed as per the testing metrics described in Table 2.
  • Average results: As shown in Table 14, on average the images are detected 82.12 (ms) earlier, which is around 66.54% before the full image is captured perfectly in FPE. In terms of event sequence, the image is detected around 60.57% earlier, or 14,239 events before the end of the image is accumulated in FPE.
    Table 15 summarizes the testing metrics per image category; only three categories are reported here for reference.
  • Sample test image results: Figure 15 illustrates sample test images at the first zero, first perfect, and end of saccade events. It can be noticed that at the FZE images, the details of the image are not yet displayed; however, the network can still recognize the image with an average probability of 45.0%. The probability keeps increasing as more events are processed as illustrated in Figure 16.
    The result of processing a single raw image file from class (automobile), which is shown in Figure 15 (second row), is discussed here in detail. The selected raw file contains an event count of 230,283 events which are collected on a duration of 1330 (ms). Each saccade is 55.53 (ms). As the events are processed, the image is updated and feedforwarded to the InceptionResNetV2 trained network. The network predicts the category of the image and provides probability against the 10 classes. Figure 16 illustrates the probability of the image (black line) against the time sequence. As discussed above, as more events are processed, the probability increases. For this image, the network is able to detect the zero and perfect events as described in Table 16. It is also noted that the zero time and perfect time are both below 55.53 (ms), which is the duration of the first saccade.

5.6. N-Caltech 101

As stated in Section 4, the ResNet50 is trained on the original Caltech-101 dataset. To calculate its accuracy, the trained model is tested on 20% of the original Caltech101 dataset and reports an accuracy of 94.87%.
The 1601 raw event-based recording images are selected from the N-CALTECH101 dataset. The set contains recordings with events collected at an average time of 300.14 (ms) and an average event count of 115,298 per recording. The size of the events collected is 240 × 180 pixels, so it is resized to 224 × 224 to match the input of the ResNet50 architecture. As the dataset is very large, the recognition is conducted every 50 events. The results are analyzed as per the testing metrics described in Table 2.
  • Average results: As shown in Table 17, on average the images are detected 12.58 (ms) earlier, which is around 52.15% before the full image is captured perfectly in FPE. In terms of event sequence, the image is detected around 69.20% earlier, or 6074 events before the end of the image is accumulated in FPE.
    Table 18 summarizes the testing metrics per image category; only three categories are reported here for reference.
  • Sample test image results: Figure 17 illustrates sample test images at the first zero, first perfect, and end of saccade events. It can be noticed that in the FZE images, the details of the image are not yet displayed; however, the network can still recognize the image with an average probability of 20.46%. The probability keeps increasing as more events are processed as illustrated in Figure 18.
The result of processing a single raw image file from class (automobile), which is shown in Figure 17 (first row), is discussed here in detail. The selected raw file contains an event count of 171,982 events which are collected at a duration of 300.33 (ms). Each saccade is 100.11 (ms). As the events are processed, the image is updated and feedforwarded to the ResNet50 trained network. The network predicts the category of the image and provides probability against the 101 classes. Figure 18 illustrates the probability of the image (black line) against the time sequence. As discussed above, as more events are processed the probability increases. For this image, the network is able to detect the zero and perfect event, as described in Table 19. It is also noted that the zero time and perfect time are both below 100.11 (ms) which is the duration of the first saccade.

6. Enhanced Early Recognition Analysis and Results

The recognition is enhanced by training the InceptionResNetV2 neural network in Section 4 on noisy images, as shown in Figure 19, referred to in this work as partial pictures (PP). To calculate its accuracy, the trained model is tested on 10,000 images of the noised MNIST dataset and reports an accuracy of 98.4%.
The same setup and input preprocessing techniques mentioned in previous sections are used and results are reported in this section.

6.1. CeleX-MNIST

  • Average results: As shown in Table 20, on average the images are detected 13.85 (ms) earlier, which is around 47.35% before the full image is displayed. In terms of event sequence, the image is detected around 35.87% before, or 46,368 events before the full image is accumulated.
    Table 21 summarizes the testing metrics per image category; only three categories are reported here for reference.

6.2. MNIST-DVS

  • Average results: As shown in Table 22, on average the images are detected 24.47 (ms) earlier, which is around 50.46% before the full image is displayed. In terms of event sequence, the image is detected around 51.42% before, or 865 events before the full image is accumulated.
    Table 23 summarizes the testing metrics per image category; only three categories are reported here for reference.

6.3. FLASH-MNIST

  • Average results: As shown in Table 24, on average the images are detected 24.47 (ms) earlier, which is around 50.46% before the full image is displayed. In terms of event sequence, the image is detected around 51.42% before, or 865 events before the full image is accumulated.
    Table 25 summarizes the testing metrics per image category; only three categories are reported here for reference.

6.4. N-MNIST

  • Average results: As shown in Table 26, on average the images are detected 10.79 (ms) earlier, which is around 22.49% before the full image is displayed. In terms of event sequence, the image is detected around 36.33% before, or 218 events before the full image is accumulated.
    Table 27 summarizes the testing metrics per image category; only three categories are reported here for reference.

7. Conclusions

In this work, fast and early object recognition for real-time applications is explored by using event-based imagers instead of full-frame cameras. The main concept is to be able to recognize an image before it is fully displayed, using events sequence rather than full-frame images. This technique allows us to decrease the time and processing power required for object recognition, leading to lower power consumption as well.
First, the InceptionResNetV2 and RestNet50 neural networks are trained on the original MNIST, CIFAR-10, and Caltech101 datasets and then tested using a pre-collected CeleX-MNIST, MNIST-DVS, FLASH-MNIST, N-MNIST, DIFAR-10, and N-Caltech101 datasets using an event-based imager. The testing metrics are based on calculating how early the network can detect an image before the full-frame image is captured.
As summarized in Table 28, we notice that on average for all the datasets, we were able to recognize an image 38.7 (ms) earlier, which is a reduction of 34% of the time needed. Less processing is also required for the image recognized 9460 events earlier, which is 37% less.These early timings are compared to the first perfect event, which is when the algorithm can detect an image with an accuracy of 95%. However, this is not when the last event is displayed. The last is event is received at the end of the saccade. The time difference between the first zero and saccade end on average is 603 (ms), excluding CIFAR-10, which did not perform well. In other words, we are able to detect an image 69.1% earlier.
Furthermore, the same neural network architecture is then trained on partial pictures to explore enhancing the early recognition time (FZE) and the saccade difference time, which refers to when the last event is received. This test was only performed on MNIST datasets, and as shown in Table 29, it can be noticed that FZE is reduced and is detected at 104.2 (ms) instead of 160.4 (ms). Moreover, the time difference between the first zero and saccade end is also reduced, and on average is 789.1 (ms); in other words, we are able to detect an image 71.0% earlier.

Author Contributions

Conceptualization, A.A. (Abubakar Abubakar) and A.A. (AlKhzami AlHarami); methodology, A.A. (Abubakar Abubakar) and A.A. (AlKhzami AlHarami); software, A.A. (Abubakar Abubakar) and A.A. (AlKhzami AlHarami); validation, A.A. (Abubakar Abubakar) and A.A. (AlKhzami AlHarami); formal analysis, A.A. (Abubakar Abubakar) and A.A. (AlKhzami AlHarami); writing—original draft preparation, A.A. (Abubakar Abubakar) and A.A. (AlKhzami AlHarami); writing—review and editing, A.A. (Abubakar Abubakar), A.A. (AlKhzami AlHarami), and A.B.; visualization, A.A. (Abubakar Abubakar) and A.A. (AlKhzami AlHarami); supervision, Y.Y. and A.B.; funding acquisition, A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by NPRP under Grant NPRP13S-0212-200345 from the Qatar National Research Fund (a member of Qatar Foundation). The findings herein reflect the work and are solely the responsibility of the authors.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the reported results were conducted by the authors and are available on request from the corresponding author.

Acknowledgments

The authors would like to acknowledge the support provided by Qatar National Library (QNL).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lichtsteiner, P.; Posch, C.; Delbruck, T. A 128 × 128 120dB 15 μs latency asynchronous temporal contrast vision sensor. IEEE J. Solid State Circuits 2008, 43, 566–576. [Google Scholar] [CrossRef] [Green Version]
  2. Tang, F.; Chen, D.; Wang, B.; Bermak, A. Low-power CMOS image sensor based on column-parallel single-slope SAR quantization scheme. IEEE Trans. Electron Devices 2013, 60, 2561–2566. [Google Scholar] [CrossRef]
  3. Bermak, A.; Yung, Y. A DPS array with programmable resolution and re-configurable conversion time. IEEE Trans. Very Large Scale Integr. Syst. 2006, 14, 15–22. [Google Scholar] [CrossRef]
  4. Law, M.; Bermak, A.; Shi, C. A low-power energy-harvesting logarithmic CMOS image sensor with reconfigurable resolution using two-level quantization scheme. IEEE Trans. Circuits Syst. II 2011, 58, 80–84. [Google Scholar] [CrossRef]
  5. Chen, D.; Matolin, D.; Bermak, A.; Posch, C. Pulse-modulation imaging—Review and performance analysis. IEEE Trans. Biomed. Circuits Syst. 2011, 5, 64–82. [Google Scholar] [CrossRef]
  6. Shoushun, C.; Bermak, A. A low power CMOS imager based on Time-To-First-Spike encoding and fair AER. In Proceedings of the 2005 IEEE International Symposium On Circuits Furthermore, Systems, Kobe, Japan, 23–26 May 2005. [Google Scholar]
  7. Jiang, R.; Mou, X.; Shi, S. Object Tracking on Event Cameras with Offline-Online Learning. CAAI Trans. Intell. Technol. 2019, 5, 165–171. [Google Scholar] [CrossRef]
  8. Ghosh, R.; Mishra, A.; Orchard, G.; Thakor, N. Real-time object recognition and orientation estimation using an event-based camera and CNN. In Proceedings of the IEEE 2014 Biomedical Circuits Furthermore, Systems Conference (BioCAS) Proceedings, Lausanne, Switzerland, 22–24 October 2014; pp. 544–547. [Google Scholar]
  9. Wang, Y.; Du, B.; Shen, Y.; Wu, K.; Zhao, G.; Sun, J.; Wen, H. EV-gait: Event-based robust gait recognition using dynamic vision sensors. In Proceedings of the 2019 IEEE CVF Conference On Computer Vision Furthermore, Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 6351–6360. [Google Scholar]
  10. Liu, H.; Moeys, D.; Das, G.; Neil, D.; Liu, S.; Delbruck, T. Combined frame- and event-based detection and tracking. In Proceedings of the IEEE International Symposium On Circuits Furthermore, Systems, Montreal, QC, Canada, 22–25 May 2016; pp. 2511–2514. [Google Scholar]
  11. Cannici, M.; Ciccone, M.; Romanoni, A.; Matteucci, M. Asynchronous convolutional networks for object detection in neuromorphic cameras. In Proceedings of the IEEE Computer Society Conference On Computer Vision Furthermore, Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
  12. Li, J.; Shi, F.; Liu, W.; Zou, D.; Wang, Q.; Lee, H.; Park, P.; Ryu, H. Adaptive temporal pooling for object detection using dynamic vision sensor. In Proceedings of the British Machine Vision Conference 2017, BMVC 2017, London, UK, 4–7 September 2017. [Google Scholar]
  13. Moeys, D.; Corradi, F.; Kerr, E.; Vance, P.; Das, G.; Neil, D.; Kerr, D.; Delbruck, T. Steering a predator robot using a mixed frame/event-driven convolutional neural network. In Proceedings of the 2016 2nd International Conference On Event-Based Control, Communication, Furthermore, Signal Processing (EBCCSP), Krakow, Poland, 13–15 June 2016. [Google Scholar]
  14. Barua, S.; Miyatani, Y.; Veeraraghavan, A. Direct face detection and video reconstruction from event cameras. In Proceedings of the 2016 IEEE Winter Conference On Applications Of Computer Vision, WACV 2016, Lake Placid, NY, USA, 7–10 March 2016; pp. 1–9. [Google Scholar]
  15. Pérez-Carrasco, J.; Zhao, B.; Serrano, C.; Acha, B.; Serrano-Gotarredona, T.; Chen, S.; Linares-Barranco, B. Mapping from frame-driven to frame-free event-driven vision systems by low-rate rate coding and coincidence processing—Application to feedforward convnets. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2706–2719. [Google Scholar] [CrossRef] [PubMed]
  16. Zhu, L.; Wang, X.; Chang, Y.; Li, J.; Huang, T.; Tian, Y. Event-based video reconstruction via potential-assisted spiking neural network. In Proceedings of the 2022 IEEE CVF Conference On Computer Vision Furthermore, Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 3584–3594. [Google Scholar]
  17. Ceolini, E.; Frenkel, C.; Shrestha, S.; Taverni, G.; Khacef, L.; Payvand, M.; Donati, E. Hand-Gesture Recognition Based on EMG and Event-Based Camera Sensor Fusion: A Benchmark in Neuromorphic Computing. Front. Neurosci. 2020, 14, 637. [Google Scholar] [CrossRef] [PubMed]
  18. Alharami, A.; Yang, Y.; Althani, D.; Shoushun, C.; Bermak, A. Early Image Detection Using Event-Based Vision. In Proceedings of the 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), Doha, Qatar, 2–5 February 2020; pp. 146–149. [Google Scholar]
  19. Yousefzadeh, A.; Serrano-Gotarredona, T.; Linares-Barranco, B. MNIST-DVS and FLASH-MNIST-DVS Databases. Instituto De Microelectrónica De Sevilla. Available online: http://www2.imse-cnm.csic.es/caviar/MNISTDVS.html (accessed on 20 March 2021).
  20. Orchard, G.; Jayawant, A.; Cohen, G.; Thakor, N. Converting static image datasets to spiking neuromorphic datasets using saccades. Front. Neurosci. 2015, 9, 437. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  21. Li, H.; Liu, H.; Ji, X.; Li, G.; Shi, L. CIFAR10-DVS: An event-stream dataset for object classification. Front. Neurosci. 2017, 11, 309. [Google Scholar] [CrossRef] [PubMed]
  22. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  23. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Science Department, University Of Toronto, Tech.: Toronto, ON, Canada, 2009. [Google Scholar]
  24. Fei-Fei, L.; Fergus, R.; Perona, P. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 2007, 106, 59–70. [Google Scholar] [CrossRef]
  25. Shoushun, C. Pixel Acquisition Circuit, Image Sensor and Image Acquisition System; CelePixel Technology Co. LTD: Shanghai, China, 2019. [Google Scholar]
  26. Technology, C. CelePixel CeleX-5 Chipset SDK Reference. Available online: https://github.com/CelePixel/CeleX5-MIPI/tree/master/Documentation (accessed on 15 October 2020).
  27. Serrano-Gotarredona, T.; Linares-Barranco, B. A 128 × 128 1.5% contrast sensitivity 0.9% FPN 3 μs latency 4 mW asynchronous frame-free dynamic vision sensor using transimpedance preamplifiers. IEEE J. Solid State Circuits 2013, 48, 827–838. [Google Scholar] [CrossRef]
  28. Posch, C.; Matolin, D.; Wohlgenannt, R. A QVGA 143dB dynamic range asynchronous address-event PWM dynamic image sensor with lossless pixel-level video compression. In Proceedings of the 2010 IEEE International Solid-State Circuits Conference—(ISSCC), San Francisco, CA, USA, 7–11 February 2010. [Google Scholar]
  29. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference On Artificial Intelligence, AAAI 2017, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
  30. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
  31. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference On Computer Vision Furthermore, Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Figure 1. Concept of early image recognition.
Figure 1. Concept of early image recognition.
Sensors 23 06195 g001
Figure 2. CelePixel data acquisition setup.
Figure 2. CelePixel data acquisition setup.
Sensors 23 06195 g002
Figure 3. CelePixel data acquisition conditions: (a) regular conditions; (b) with flickering light.
Figure 3. CelePixel data acquisition conditions: (a) regular conditions; (b) with flickering light.
Sensors 23 06195 g003
Figure 4. (A) ATIS mounted on a motorized pan-tilt unit. (B) ATIS positioned in front of an LCD monitor [20].
Figure 4. (A) ATIS mounted on a motorized pan-tilt unit. (B) ATIS positioned in front of an LCD monitor [20].
Sensors 23 06195 g004
Figure 5. (A) Recording setup of sensor positioned in front of an LCD monitor. (B) Image movement sequence on screen [21].
Figure 5. (A) Recording setup of sensor positioned in front of an LCD monitor. (B) Image movement sequence on screen [21].
Sensors 23 06195 g005
Figure 6. Original InceptionResNetV2 architecture.
Figure 6. Original InceptionResNetV2 architecture.
Sensors 23 06195 g006
Figure 7. Original ResNet50 architecture.
Figure 7. Original ResNet50 architecture.
Sensors 23 06195 g007
Figure 8. Sample CeleX-MNIST test images at event (left) FZE, (middle) FPE, (right) saccade end.
Figure 8. Sample CeleX-MNIST test images at event (left) FZE, (middle) FPE, (right) saccade end.
Sensors 23 06195 g008
Figure 9. Testing result of CeleX-MNIST class 2 image using time sequence.
Figure 9. Testing result of CeleX-MNIST class 2 image using time sequence.
Sensors 23 06195 g009
Figure 10. Sample MNIST-DVS test images at event (left) FZE, (middle) FPE, (right) saccade end.
Figure 10. Sample MNIST-DVS test images at event (left) FZE, (middle) FPE, (right) saccade end.
Sensors 23 06195 g010
Figure 11. Sample FLASH-MNIST test images at event (left) FZE, (middle) FPE, (right) saccade end.
Figure 11. Sample FLASH-MNIST test images at event (left) FZE, (middle) FPE, (right) saccade end.
Sensors 23 06195 g011
Figure 12. Testing result of FLASH-MNIST class 0 image using time sequence.
Figure 12. Testing result of FLASH-MNIST class 0 image using time sequence.
Sensors 23 06195 g012
Figure 13. Sample N-MNIST test images at event (left) FZE, (middle) FPE, (right) saccade end.
Figure 13. Sample N-MNIST test images at event (left) FZE, (middle) FPE, (right) saccade end.
Sensors 23 06195 g013
Figure 14. Testing result of N-MNIST class 3 image using time sequence.
Figure 14. Testing result of N-MNIST class 3 image using time sequence.
Sensors 23 06195 g014
Figure 15. Sample CIFAR-10 DVS test images at event (left) FZE, (middle) FPE, (right) saccade end. Row (1) left: airplane, right: cat. Row (2) left: automobile, right: dog.
Figure 15. Sample CIFAR-10 DVS test images at event (left) FZE, (middle) FPE, (right) saccade end. Row (1) left: airplane, right: cat. Row (2) left: automobile, right: dog.
Sensors 23 06195 g015
Figure 16. Testing result of CIFAR-10 DVS class (automobile) image using time sequence.
Figure 16. Testing result of CIFAR-10 DVS class (automobile) image using time sequence.
Sensors 23 06195 g016
Figure 17. Sample N-CALTECH101 test images at event (left) FZE, (middle) FPE, (right) saccade end. Row (1) left: accordion, right: ant. Row (2) left: anchor, right: chair.
Figure 17. Sample N-CALTECH101 test images at event (left) FZE, (middle) FPE, (right) saccade end. Row (1) left: accordion, right: ant. Row (2) left: anchor, right: chair.
Sensors 23 06195 g017
Figure 18. Testing result of N-CALTECH101 class (accordion) image using time sequence.
Figure 18. Testing result of N-CALTECH101 class (accordion) image using time sequence.
Sensors 23 06195 g018
Figure 19. Sample of noised original MNIST dataset (partial pictures).
Figure 19. Sample of noised original MNIST dataset (partial pictures).
Sensors 23 06195 g019
Table 1. Dataset statistics results.
Table 1. Dataset statistics results.
DatasetAverageMin XMin YMax XMax Y
Total EEventsON EOFF ESaccade T
CeleX-MNIST [this work]631420,546268,129152,417631012790799
MNIST-DVS (Scale 4) [19]226117,011866283492261.1101270127
MNIST-DVS (Scale 8) [19]237143,76421,84121,9222370.7001270127
MNIST-DVS (Scale 16) [19]2412103,14450,98552,1582411.8101270127
FLASH-MNIST (Test) [19]210327,32116,41010,910420.6511281128
FLASH-MNIST (Train) [19]214726,71316,01810,694429.3511281128
N-MNIST (Test) [20]306420421162087102.07134134
N-MNIST (Train) [20]307417220882084102.17134134
CIFAR-10 [21]1300183,14576,870106,27654.1901270127
N-Caltech 101 [20]300115,29858,28957,009100.0512331173
Table 2. Testing metrics.
Table 2. Testing metrics.
Zero_TimeAverage time of FZE (ms).
Zero_EventAverage event of FZE.
Zero_ProbAverage probability of FZE.
Perfect_TimeAverage time of FPE (ms).
Perfect_EventAverage event of FPE.
Perfect_ProbAverage probability of FPE.
Time_DiffAverage time difference between FPE and FZE, which determines how early an image is detected in terms of time (ms).
Time %Average time difference percentage between FPE and FZE, which determines how early an image is detected in terms of time.
Event_DiffAverage event difference between FPE and FZE, which determines how early an image is detected in terms of event number.
Event %Average event difference percentage between FPE and FZE, which determines how early an image is detected in terms of event number
Sacc_Time_DiffAverage time difference between saccade end and FZE, which determines how early an image is detected in terms of time (ms).
Table 3. Average results for CeleX-MNIST test images.
Table 3. Average results for CeleX-MNIST test images.
Zero_Time39.69 (ms)Zero_Prob56.20%
Perfect_Time51.46 (ms)Perfect_Prob78.72%
Zero_Event153,598 (events)Perfect_Event172,121 (events)
Time_Diff14.81 (ms)Time %28.78%
Event_Diff32,558 (events)Event %18.92%
Sacc_Diff613.73 (ms)Sacc %93.98%
Table 4. Average results for (3) Categories of CeleX-MNIST test images.
Table 4. Average results for (3) Categories of CeleX-MNIST test images.
Class 3 (30 images)
Zero_Time51.81 (ms)Zero_Prob47.72%
Perfect_Time64.18 (ms)Perfect_Prob74.17%
Zero_Event198,633 (events)Perfect_Event222,683 (events)
Time_Diff17.00 (ms)Time %26.48%
Event_Diff36,016 (events)Event %16.17%
Sacc_Diff621.71 (ms)Sacc %92.31%
Class 6 (30 images)
Zero_Time65.37 (ms)Zero_Prob50.95%
Perfect_Time80.19 (ms)Perfect_Prob73.17%
Zero_Event189,450 (events)Perfect_Event213,350 (events)
Time_Diff14.97 (ms)Time %18.66%
Event_Diff29,550 (events)Event %13.85%
Sacc_Diff636.20 (ms)Sacc %90.68%
Class 9 (30 images)
Zero_Time38.11 (ms)Zero_Prob49.71%
Perfect_Time54.68 (ms)Perfect_Prob80.85%
Zero_Event158,433 (events)Perfect_Event185,833 (events)
Time_Diff20.46 (ms)Time %37.42%
Event_Diff40,066 (events)Event %21.56%
Sacc_Diff557.90 (ms)Sacc %93.61%
Table 5. Average results for CeleX-MNIST class 2 test image.
Table 5. Average results for CeleX-MNIST class 2 test image.
Zero_Time3.80 (ms)Zero_Prob55.77%
Perfect_Time101.57 (ms)Perfect_Prob96.65%
Zero_Event221,000 (events)Perfect_Event505,000 (events)
Time_Diff97.7 (ms)Time %96.26%
Event_Diff284,000 (events)Event %56.24%
Table 6. Average results for MNIST-DVS test images.
Table 6. Average results for MNIST-DVS test images.
Zero_Time235.45 (ms)Zero_Prob61.65%
Perfect_Time306.11 (ms)Perfect_Prob87.22%
Zero_Event7616 (events)Perfect_Event9834 (events)
Time_Diff108.69 (ms)Time %35.51%
Event_Diff3400 (events)Event %34.57%
Sacc_Diff2176.37 (ms)Sacc %90.24%
Table 7. Average results for (3) categories of MNIST-DVS test images.
Table 7. Average results for (3) categories of MNIST-DVS test images.
Class 2 (1000 images)
Zero_Time546.72 (ms)Zero_Prob47.71%
Perfect_Time578.64 (ms)Perfect_Prob82.70%
Zero_Event17,582 (events)Perfect_Event18,303 (events)
Time_Diff101.3 (ms)Time %17.51%
Event_Diff1865 (events)Event %16.31%
Sacc_Diff1865.09 (ms)Sacc %77.33%
Class 4 (1000 images)
Zero_Time188.48 (ms)Zero_Prob65.67%
Perfect_Time322.13 (ms)Perfect_Prob78.24%
Zero_Event5185 (events)Perfect_Event8546 (events)
Time_Diff179.34 (ms)Time %55.67%
Event_Diff4633 (events)Event %54.21%
Sacc_Diff2223.33 (ms)Sacc %92.19%
Class 6 (1000 images)
Zero_Time452.23 (ms)Zero_Prob50.70%
Perfect_Time515.07 (ms)Perfect_Prob79.66%
Zero_Event12,301 (events)Perfect_Event13,844 (events)
Time_Diff146.05 (ms)Time %28.35%
Event_Diff3650 (events)Event %26.36%
Sacc_Diff1959.59 (ms)Sacc %81.25%
Table 8. Average results for FLASH-MNIST test images.
Table 8. Average results for FLASH-MNIST test images.
Zero_Time336.79 (ms)Zero_Prob66.80%
Perfect_Time335.82 (ms)Perfect_Prob96.00%
Zero_Event3581 (events)Perfect_Event3842 (events)
Time_Diff5.76 (ms)Time %1.72%
Event_Diff324 (events)Event %8.43%
Sacc_Diff83.86 (ms)Sacc %19.94%
Table 9. Average results for (3) categories of FLASH-MNIST test images.
Table 9. Average results for (3) categories of FLASH-MNIST test images.
Class 0 (980 images)
Zero_Time372.41 (ms)Zero_Prob63.15%
Perfect_Time375.58 (ms)Perfect_Prob97.64%
Zero_Event4402 (events)Perfect_Event4860 (events)
Time_Diff3.17 (ms)Time %0.84%
Event_Diff458 (events)Event %9.41%
Sacc_Diff51.60 (ms)Sacc %12.20%
Class 3 (1010 images)
Zero_Time374.31 (ms)Zero_Prob60.56%
Perfect_Time372.17 (ms)Perfect_Prob96.87%
Zero_Event4229 (events)Perfect_Event4532 (events)
Time_Diff2.36 (ms)Time %0.63%
Event_Diff329 (events)Event %7.27%
Sacc_Diff52.80 (ms)Sacc %12.60%
Class 7 (1028 images)
Zero_Time369.08 (ms)Zero_Prob65.89%
Perfect_Time366.89 (ms)Perfect_Prob93.78%
Zero_Event3030 (events)Perfect_Event3313 (events)
Time_Diff7.29 (ms)Time %1.99%
Event_Diff350 (events)Event %10.57%
Sacc_Diff42.62 (ms)Sacc %10.14%
Table 10. Average results for FLASH-MNIST class 0 test image.
Table 10. Average results for FLASH-MNIST class 0 test image.
Zero_Time370.63 (ms)Zero_Prob26.64%
Perfect_Time385.58 (ms)Perfect_Prob98.15%
Zero_Event4200 (events)Perfect_Event6400 (events)
Time_Diff14.95 (ms)Time %3.88%
Event_Diff2200 (events)Event %34.38%
Table 11. Average results for N-MNIST test images.
Table 11. Average results for N-MNIST test images.
Zero_Time32.47 (ms)Zero_Prob67.09%
Perfect_Time39.65 (ms)Perfect_Prob96.40%
Zero_Event358 (events)Perfect_Event516 (events)
Time_Diff7.89 (ms)Time %19.91%
Event_Diff167 (events)Event %32.26%
Sacc_Diff69.60 (ms)Sacc %68.18%
Table 12. Average results for (3) categories of N-MNIST test images.
Table 12. Average results for (3) categories of N-MNIST test images.
Class 5 (892 images)
Zero_Time25.17 (ms)Zero_Prob66.09%
Perfect_Time29.88 (ms)Perfect_Prob97.31%
Zero_Event207 (events)Perfect_Event313 (events)
Time_Diff4.74 (ms)Time %15.86%
Event_Diff107 (events)Event %34.07%
Sacc_Diff76.90 (ms)Sacc %75.34%
Class 8 (974 images)
Zero_Time34.89 (ms)Zero_Prob67.58%
Perfect_Time44.24 (ms)Perfect_Prob96.78%
Zero_Event396 (events)Perfect_Event598 (events)
Time_Diff9.96 (ms)Time %22.51%
Event_Diff210 (events)Event %35.11%
Sacc_Diff67.18 (ms)Sacc %65.81%
Class 9 (1009 images)
Zero_Time31.75 (ms)Zero_Prob60.02%
Perfect_Time36.82 (ms)Perfect_Prob95.01%
Zero_Event252 (events)Perfect_Event357 (events)
Time_Diff6.33 (ms)Time %17.19%
Event_Diff116 (events)Event %32.36%
Sacc_Diff70.32 (ms)Sacc %68.90%
Table 13. Average results for N-MNIST class 3 test image.
Table 13. Average results for N-MNIST class 3 test image.
Zero_Time39.34 (ms)Zero_Prob86.44%
Perfect_Time96.79 (ms)Perfect_Prob99.88%
Zero_Event700 (events)Perfect_Event1950 (events)
Time_Diff57.45 (ms)Time %59.36%
Event_Diff1250 (events)Event %64.10%
Table 14. Average results for CIFAR-10 DVS test images.
Table 14. Average results for CIFAR-10 DVS test images.
Zero_Time88.96 (ms)Zero_Prob44.99%
Perfect_Time123.42 (ms)Perfect_Prob58.45%
Zero_Event18,519 (events)Perfect_Event23,507 (events)
Time_Diff82.12 (ms)Time %66.54%
Event_Diff14,239 (events)Event %60.57%
Table 15. Average results for (3) categories of CIFAR-10 DVS test images.
Table 15. Average results for (3) categories of CIFAR-10 DVS test images.
Class Airplane (1000 images)
Zero_Time11.88 (ms)Zero_Prob54.40%
Perfect_Time43.97 (ms)Perfect_Prob92.69%
Zero_Event3960 (events)Perfect_Event9679 (events)
Time_Diff33.77 (ms)Time %72.98%
Event_Diff6062 (events)Event %59.09%
Class Bird (380 images)
Zero_Time129.43 (ms)Zero_Prob43.95%
Perfect_Time155.67 (ms)Perfect_Prob45.78%
Zero_Event26,341 (events)Perfect_Event26,521 (events)
Time_Diff32.09 (ms)Time %16.86%
Event_Diff180 (events)Event %0.68%
Class Cat (1000 images)
Zero_Time93.39 (ms)Zero_Prob52.39%
Perfect_Time180.91 (ms)Perfect_Prob65.54%
Zero_Event22,720 (events)Perfect_Event35,133 (events)
Time_Diff87.52 (ms)Time %48.38%
Event_Diff12,414 (events)Event %35.33%
Table 16. Average results for CIFAR-10 DVS class (automobile) test image.
Table 16. Average results for CIFAR-10 DVS class (automobile) test image.
Zero_Time2.77 (ms)Zero_Prob30.01%
Perfect_Time48.08 (ms)Perfect_Prob95.71%
Zero_Event1700 (events)Perfect_Event7800 (events)
Time_Diff45.31 (ms)Time %94.25%
Event_Diff6100 (events)Event %78.21%
Table 17. Average results for N-CALTECH101 test images.
Table 17. Average results for N-CALTECH101 test images.
Zero_Time26.60 (ms)Zero_Prob20.46%
Perfect_Time24.11 (ms)Perfect_Prob45.46%
Zero_Event7701 (events)Perfect_Event8778 (events)
Time_Diff12.58 (ms)Time %52.15%
Event_Diff6074 (events)Event %69.20%
Sacc_Diff73.45 (ms)Sacc %73.41%
Table 18. Average results for (3) categories of N-CALTECH101 test images.
Table 18. Average results for (3) categories of N-CALTECH101 test images.
Class menorah (87 images)
Zero_Time18.17 (ms)Zero_Prob18.33%
Perfect_Time427.92 (ms)Perfect_Prob92.45%
Zero_Event3836 (events)Perfect_Event8069.54 (events)
Time_Diff11.24 (ms)Time %40.25%
Event_Diff5252 (events)Event %65.09%
Sacc_Diff81.88 (ms)Sacc %81.84%
Class stop_sign (64 images)
Zero_Time45.42 (ms)Zero_Prob22.70%
Perfect_Time50.11 (ms)Perfect_Prob52.7%
Zero_Event15,113 (events)Perfect_Event19,893 (events)
Time_Diff27.71 (ms)Time %55.31%
Event_Diff11,881 (events)Event %59.73%
Sacc_Diff54.63 (ms)Sacc %54.60%
Class yin_yang (60 images)
Zero_Time15.96 (ms)Zero_Prob21.60%
Perfect_Time38.39 (ms)Perfect_Prob70.39%
Zero_Event3200 (events)Perfect_Event9685 (events)
Time_Diff28.32 (ms)Time %73.76%
Event_Diff8649 (events)Event %89.30%
Sacc_Diff84.09 (ms)Sacc %84.05%
Table 19. Average results for N-CALTECH101 class (accordion) test image.
Table 19. Average results for N-CALTECH101 class (accordion) test image.
Zero_Time11.85 (ms)Zero_Prob22.32%
Perfect_Time39.51 (ms)Perfect_Prob95.50%
Zero_Event1650 (events)Perfect_Event18,100 (events)
Time_Diff27.65 (ms)Time %70.00%
Event_Diff16,450 (events)Event %90.88%
Table 20. Average PP results for CeleX-MNIST test images.
Table 20. Average PP results for CeleX-MNIST test images.
Zero_Time18.17 (ms)Zero_Prob58.45%
Perfect_Time28.64 (ms)Perfect_Prob81.29%
Zero_Event104,198 (events)Perfect_Event133,067 (events)
Time_Diff13.85 (ms)Time %47.35%
Event_Diff46,368 (events)Event %35.87%
Sacc_Diff635.24 (ms)Sacc %97.23%
Table 21. Average PP results for (3) categories of CeleX-MNIST test images.
Table 21. Average PP results for (3) categories of CeleX-MNIST test images.
Class 3 (30 images)
Zero_Time21.21 (ms)Zero_Prob56.17%
Perfect_Time25.50 (ms)Perfect_Prob81.07%
Zero_Event127,317 (events)Perfect_Event141,867 (events)
Time_Diff8.75 (ms)Time %34.31%
Event_Diff38,983 (events)Event %27.48%
Sacc_Diff652.31 (ms)Sacc %96.85%
Class 6 (30 images)
Zero_Time25.95 (ms)Zero_Prob49.57%
Perfect_Time44.33 (ms)Perfect_Prob85.98%
Zero_Event106,300 (events)Perfect_Event164,700 (events)
Time_Diff21.04 (ms)Time %47.47%
Event_Diff74,933 (events)Event %45.50%
Sacc_Diff675.63 (ms)Sacc %96.30%
Class 9 (30 images)
Zero_Time4.89 (ms)Zero_Prob66.06%
Perfect_Time20.62 (ms)Perfect_Prob95.64%
Zero_Event34,817 (events)Perfect_Event103,483 (events)
Time_Diff15.73 (ms)Time %76.28%
Event_Diff68,717 (events)Event %66.40%
Sacc_Diff591.12 (ms)Sacc %99.18%
Table 22. Average PP results for MNIST-DVS test images.
Table 22. Average PP results for MNIST-DVS test images.
Zero_Time29.85 (ms)Zero_Prob62.87%
Perfect_Time43.81 (ms)Perfect_Prob79.49%
Zero_Event998 (events)Perfect_Event1495 (events)
Time_Diff24.47 (ms)Time %50.46%
Event_Diff865 (events)Event %51.42%
Sacc_Diff2366.45 (ms)Sacc %98.76%
Table 23. Average PP results for (3) categories of MNIST-DVS test images.
Table 23. Average PP results for (3) categories of MNIST-DVS test images.
Class 2 (1000 images)
Zero_Time76.47 (ms)Zero_Prob44.14%
Perfect_Time91.50 (ms)Perfect_Prob61.66%
Zero_Event2308 (events)Perfect_Event2696 (events)
Time_Diff48.53 (ms)Time %53.04%
Event_Diff1455 (events)Event %53.99%
Sacc_Diff2313.96 (ms)Sacc %96.80%
Class 4 (1000 images)
Zero_Time23.18 (ms)Zero_Prob72.23%
Perfect_Time36.18 (ms)Perfect_Prob91.88%
Zero_Event610 (events)Perfect_Event961 (events)
Time_Diff16.44 (ms)Time %45.43%
Event_Diff438 (events)Event %45.63%
Sacc_Diff2364.61 (ms)Sacc %99.03%
Class 6 (1000 images)
Zero_Time21.75 (ms)Zero_Prob59.56%
Perfect_Time34.78 (ms)Perfect_Prob75.12%
Zero_Event547 (events)Perfect_Event896 (events)
Time_Diff20.09 (ms)Time %57.76%
Event_Diff527 (events)Event %58.85%
Sacc_Diff2364.90 (ms)Sacc %99.09%
Table 24. Average PP results for FLASH-MNIST test images.
Table 24. Average PP results for FLASH-MNIST test images.
Zero_Time335.37 (ms)Zero_Prob72.52%
Perfect_Time338.37 (ms)Perfect_Prob97.33%
Zero_Event3460 (events)Perfect_Event3820 (events)
Time_Diff5.73 (ms)Time %1.51%
Event_Diff391 (events)Event %9.33%
Sacc_Diff83.86 (ms)Sacc %19.94%
Table 25. Average PP results for (3) categories of FLASH-MNIST test images.
Table 25. Average PP results for (3) categories of FLASH-MNIST test images.
Class 0 (980 images)
Zero_Time371.17 (ms)Zero_Prob61.54%
Perfect_Time374.80 (ms)Perfect_Prob97.49%
Zero_Event4862 (events)Perfect_Event5516 (events)
Time_Diff3.63 (ms)Time %0.97%
Event_Diff654 (events)Event %11.85%
Sacc_Diff51.60 (ms)Sacc %12.20%
Class 3 (1010 images)
Zero_Time366.30 (ms)Zero_Prob69.26%
Perfect_Time371.96 (ms)Perfect_Prob97.55%
Zero_Event4723 (events)Perfect_Event5059 (events)
Time_Diff7.21 (ms)Time %1.94%
Event_Diff370 (events)Event %7.31%
Sacc_Diff52.80 (ms)Sacc %12.60%
Class 7 (1028 images)
Zero_Time377.81 (ms)Zero_Prob68.64%
Perfect_Time384.87 (ms)Perfect_Prob95.39%
Zero_Event3368 (events)Perfect_Event3991 (events)
Time_Diff16.79 (ms)Time %4.36%
Event_Diff718 (events)Event %17.98%
Sacc_Diff42.62 (ms)Sacc %10.14%
Table 26. Average PP results for N-MNIST test images.
Table 26. Average PP results for N-MNIST test images.
Zero_Time33.33 (ms)Zero_Prob71.49%
Perfect_Time43.28 (ms)Perfect_Prob96.38%
Zero_Event383 (events)Perfect_Event592 (events)
Time_Diff10.79 (ms)Time %22.49%
Event_Diff218 (events)Event %36.33%
Sacc_Diff69.11 (ms)Sacc %67.46%
Table 27. Average PP results for (3) categories of N-MNIST test images.
Table 27. Average PP results for (3) categories of N-MNIST test images.
Class 5 (892 images)
Zero_Time35.02 (ms)Zero_Prob69.90%
Perfect_Time47.41 (ms)Perfect_Prob9713%
Zero_Event476 (events)Perfect_Event765 (events)
Time_Diff12.92 (ms)Time %27.24%
Event_Diff298 (events)Event %38.92%
Sacc_Diff67.59 (ms)Sacc %65.87%
Class 8 (974 images)
Zero_Time49.57 (ms)Zero_Prob71.91%
Perfect_Time59.69 (ms)Perfect_Prob96.82%
Zero_Event854 (events)Perfect_Event1077 (events)
Time_Diff11.47 (ms)Time %19.21%
Event_Diff239 (events)Event %22.18%
Sacc_Diff52.91 (ms)Sacc %51.63%
Class 9 (1009 images)
Zero_Time22.55 (ms)Zero_Prob71.26%
Perfect_Time30.42 (ms)Perfect_Prob98.05%
Zero_Event117 (events)Perfect_Event236 (events)
Time_Diff78.65 (ms)Time %25.86%
Event_Diff119 (events)Event %50.53%
Sacc_Diff79.99 (ms)Sacc %78.01%
Table 28. Dataset recognition results summary.
Table 28. Dataset recognition results summary.
DatasetTotal Average
MNISTCIFAR 10N-CALTECH 101
CeleXDVSFLASHNavg
Zero
Time (ms)36.7235.5336.832.5160.489.026.6126.2
Prob %56.261.766.867.163.045.020.552.9
Event153,5987616351835841,27318,519770131,885
Perfect
Time (ms)51.5306.1335.939.65183.3123.424.1146.8
Prob %78.787.296.096.489.658.545.577.1
Event172,1217616384251646,02423,507877836,063
Time
Diff (ms)14.8108.75.87.934.382.112.638.7
Diff %28.835.51.719.921.566.552.234.1
Event
Diff32,5583400324167911214,23960749460
Diff %18.934.68.4332.323.660.669.237.3
Saccade
Diff613.72176.483.969.6735.9-73.5603.4
Diff %93.990.219.968.268.1-73.469.1
Table 29. Dataset partial pictures recognition results summary.
Table 29. Dataset partial pictures recognition results summary.
DatasetAverage
MNIST
CeleXDVSFLASHN
Zero
Time (ms)18.229.8335.433.3104.2
Prob %58.562.973.571.566.6
Event104,198998346038327,260
Perfect
Time (ms)28.643.8338.443.3113.5
Prob %81.379.597.396.488.6
Event133,0671495382059234,743
Time
Diff (ms)13.924.55.710.813.7
Diff %47.350.51.522.530.5
Event
Diff46,36886539121811,960
Diff %35.951.49.336.333.2
Saccade
Diff635.22366.585.669.1789.1
Diff %97.298.820.467.571.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abubakar, A.; AlHarami, A.; Yang, Y.; Bermak, A. Extreme Early Image Recognition Using Event-Based Vision. Sensors 2023, 23, 6195. https://doi.org/10.3390/s23136195

AMA Style

Abubakar A, AlHarami A, Yang Y, Bermak A. Extreme Early Image Recognition Using Event-Based Vision. Sensors. 2023; 23(13):6195. https://doi.org/10.3390/s23136195

Chicago/Turabian Style

Abubakar, Abubakar, AlKhzami AlHarami, Yin Yang, and Amine Bermak. 2023. "Extreme Early Image Recognition Using Event-Based Vision" Sensors 23, no. 13: 6195. https://doi.org/10.3390/s23136195

APA Style

Abubakar, A., AlHarami, A., Yang, Y., & Bermak, A. (2023). Extreme Early Image Recognition Using Event-Based Vision. Sensors, 23(13), 6195. https://doi.org/10.3390/s23136195

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop