1. Introduction
Generally, we design a web-based search system to get accurate answers for the given queries. Meanwhile, the type of search query has diversified in web and mobile environments. In the past, only keywords were used as input query but nowadays it has been extended to sentence, voice, image, and video [
1,
2,
3,
4]. In addition, by expanding the definition of semantic search, the searching service is developing into a customized service that not only shows objective results but also recommends information that users may like [
5]. For example, they recommend and show customized image or video content using information such as user interests, search time, climate, and the place where the user resides.
To use this function, it is essential to use some additional contextual information. Researchers has actively conducted to apply deep learning to extract context information from text, image, video, and voice data. In particular, by using contextual information based on image or video data, it has so far been successfully applied to computer vision tasks such as object detection, semantic segmentation, and image classification [
6,
7,
8,
9,
10,
11,
12]. This information can describe objects’ class, background, as well as some situations or relationship between them [
13,
14,
15].
There are some useful components in image and video processing. For a long time, texture and color components have had low-level key features for scene description [
15]. By using this information, an image descriptor called Contextual Mean Census Transform (CMCT), which combines distribution of local structures and contextual information, was designed to classify scenes. However, it needs to understand a more high-level context, such as the relationship between people or their emotions in the scene. This means that the traditional feature-based approaches cannot support making high-level understanding. Therefore, more high-level abstractive information such as when (time), where (location), what to do (action), with whom (relationship with other) and so on should be extracted from the image and video data [
16,
17,
18].
In order to obtain high-level abstracted information, we focus on extracting time information from images. It can be used as important metadata in the search or recommendation system. To extract it from image data, we paid attention to the amount of sunlight because it causes the change of time during the day—dawn, morning, midday, evening, and night. In this aspect, the sky region represents the change of time in outdoor images well and can provide useful information about the environment, especially for the given time. Therefore, by extracting sky area and analyzing its characteristics, it is able to derive meaningful time information during the day.
In this paper, we propose an efficient algorithm to classify time by analyzing the sky region with deep learning and color histogram. The proposed algorithm is trained by Mask R-CNN [
19] to extract the sky region from the given image. Based on the extracted region, the reference color histograms are generated, which can be considered as the ground-truth. Time is classified by comparing the reference color histogram with a histogram of the sky region extracted from test dataset. We design the windowed-color histograms (for RGB bands) to compare each time period with the reference and input efficiently. In addition, we propose a weighted histogram representation which compares histograms by assigning large weights according to the degree of color importance.
The main contribution of this study is the followings:
We focus on the sky area of the image because it has more information on light and discriminating feature;
The designed weighted histogram comparison can pay attention on more important colors in each class;
As a result, we simplify the problem and improve the time classification performance better than existing deep learning models.
This study can have a wide range of applications in the real world. If time information is used, a web-based search system can categorize results into dawn, dusk, day, or night. In addition, it can recommend contents based on the related time of queries. We aim to provide more accurate search and recommendation results by using time information of image or video.
The remaining part of this paper is organized as follows: In
Section 2, we analyze some existing methods as the related works.
Section 3 describes the proposed approach and
Section 4 describes the experimental results. Conclusions are presented in
Section 5.
2. Related Works
Some researches for time classification of the single image have been conducted relying on computer vision algorithms. We categorized into three types: traditional image processing techniques, deep learning-based models, sky detection-based algorithms.
2.1. Traditional Image Processing Techniques
Traditional image processing techniques for time classification have employed mainly histograms of images [
20,
21,
22]. Saha et al., classified daytime and nighttime using thresholding HSV histogram. This method determined the first parameter, representing an amount of red or yellow pixels [
20]. In addition, they determined the second parameter of light. This method used brightness and color components, which are important for classifying time. Similarly, Taha et al., used a HSV histogram and discriminated daytime and nighttime scenes employing thresholding H-histogram and V-histogram [
21]. In addition, Park et al., proposed a time classification method that used intensity, chromaticity in RGB component, and
k-means segmentation [
22].
However, these methods can only classify binary class. Because the difference between day and night is clear, it is easy to separate. However, it is still challenging to classify day, dawn (dusk), and night. We use a RGB color histogram representing the difference of time period and successfully classify them with high accuracy.
2.2. Deep Learning-Based Models
Deep learning-based image analyses have shown a large progress in computer vision [
23,
24,
25,
26,
27,
28,
29,
30,
31]. Deep models can recognize various features such as low-level as well as high level features. Volokitin et al., and Ibrahim et al., have used deep learning models to get more complex feature and classified time into two or more classes [
30,
31].
Volokitin et al., first attempted at predicting time classification in outdoor scenes [
30]. They employed pooling layers that provide better features than fully connected layers for these tasks. They obtained performance, especially at month and season level. However, the accuracy on the time of day was relatively low.
Ibrahim et al., have proposed a model to extract a visual condition such as dawn/dusk, day, and night for time detection. They trained a model relying on residual learning using the ResNet50 architecture [
31]. It gave a little lower accuracy of time classification. We also conducted an experiment employing ResNet152 architecture and fine-tuning our own dataset. The proposed algorithm achieves better performance than ResNet, which has been used in [
31].
2.3. Sky Detection-Based Algorithms
To estimate the time period of a day, the sky area contains meaningful information due to the change of sun light. Therefore, we noted that sky region extraction is an important module as the pre-processing of time classification. There are some works to detect the sky region in an image [
32,
33,
34,
35,
36].
Shen et al., used a histogram to extract the sky region [
32]. However, there are several problems. First, it takes time to detect the sky region. Second, it often fails to detect the extract sky region. Therefore, we employ Mask R-CNN, a well known semantic segmentation way to resolve these problems [
19]. Sharma et al., have made their own dataset using metadata and classified the time segment using deep-learning when an input containing a sky region was given [
37].
However, it also has the following problems: First, the metadata is not exact. Second, there is an ambiguity in the categories. Third, the accuracy of classifying the time stamps was not good due to two previous reasons. Therefore, we are going to use only the sky region of the image to classify time variation. We propose an efficient algorithm to classify time by analyzing the color histogram of the sky region.