1. Introduction
Currently, with the rapid development of Internet technology, people enjoy the convenience brought by the Internet, but this also brings network security risks. In 1981, the first computer virus known to be widely spread (ELK Cloner) was born. ELK Cloner was not destructive at the time, but was designed as a good-natured joke. However, with over time, the number of malware has been increasing every year, and many variants have been created. At the same time, the harmfulness of malware is also increasing. Malware can be installed and run on computers or terminal devices without the consent of users, thus harming the legitimate rights and interests of users.
According to the China Internet Network Security Monitoring Data Analysis Report, in the first half of 2021, the number of malicious program samples captured was about 23.07 million, and the daily average number of transmissions reached more than 5.82 million, involving about 208,000 malicious program families, and more than 866,000 new mobile Internet malicious programs were found through independent capture and vendor exchange. The number of hosts infected with computer malicious programs in China was about 4.46 million, up 46.8% year-on-year. There are many ways for malware to invade or spread. Attackers can take advantage of vulnerabilities in web services, browsers, and operating systems or use social engineering technology to allow end users run malicious code to spread malicious software. In addition, attackers also use some obfuscation technologies, such as code integration, dead-code insertion, subroutine reordering, instruction substitution, and code transposition, to circumvent the detection by traditional defense means (such as firewalls, anti-virus software, and gateways) [
1]. Moreover, this affects manual analysis to some extent. Therefore, how to detect and classify malware quickly and accurately is a research hotspot in the field of network security.
Malware detection technologies are generally divided into two categories: static [
2,
3,
4,
5] and dynamic [
6,
7,
8,
9] analysis methods. The static analysis methods do not require actual running of malware samples and focus on code analysis, for example analyzing the structure and syntax of malicious binary files and performing classification detection according to family characteristics. Dynamic analysis methods can reveal the behavior of malware, which is not affected by obfuscation technology and can detect unknown malware samples. However, this is very time-consuming, and the operation steps are very cumbersome. Although the above problems can be solved by using static analysis methods, they rely on efficient anti-virus engines and comprehensive virus databases, resulting in the lower accuracy of sample detection.
In order to compensate for the shortcomings of static analysis methods, relevant personnel began to investigate the malware detection and classification methods based on visualization [
10,
11] and have achieved good performance. Despite the increasing number and variety of malware, the core of the binary or assembly code of malware in the same family has similarities. After visualization, the essential characteristics of image texture and structure will not be changed, which can effectively combat the problem of malware confusion. Machine-learning-based visualization techniques [
12,
13] are simple to operate, short in consumption time, and less dependent on data. However, they are vulnerable to the impact of feature engineering and have difficulty dealing with large volumes of malware. Although significant numbers of malware samples can be identified daily, it is still very difficult to detect and classify malware solely by relying on manual work.
In recent years, deep learning [
14,
15,
16,
17] has developed rapidly in detecting malware samples, avoiding the complexity of the manually extraction of features in traditional machine learning. At the same time, it reduces the impact on the accuracy of malware detection caused by analysts’ lack of experience and ability and can effectively resist malware attacks. Therefore, some researchers began to combine deep learning with visualization technologies [
18,
19,
20,
21,
22,
23] to classify and detect malware, providing a new way to detect and classify malware. Currently, most malware visualization methods convert the information in binary files into grayscale images. Then, the similarity between the texture features or other malware features is observed to determine if it is the same family. Finally, deep learning methods are combined to improve malware detection accuracy. However, visualization techniques based on currently available deep neural networks still face many challenges. For example, the convolutional neural network model used is too simple, with insufficient generalization ability, and cannot effectively extract important information contained in malware. At the same time, the extracted malware features are single and only focus on one aspect of the malware features, which affects the accuracy of classification and detection. The generated image generally needs to be uniform in size, which easily causes information loss in the process of processing and contains obvious noise features.
To solve the above problems, this paper proposes a malicious software detection and classification method based on depth learning of multi-channel RGB images. This method transforms malware into a three-channel image, combines the depth convolution neural network model with transfer learning, integrates bilinear interpolation, multi-channel feature extraction, data enhancement, and other technologies, and improves the efficiency and accuracy of classification detection based on effectively retaining malware sample information. The main contributions of this paper are summarized as follows:
A new method of malware visualization is proposed, which transforms the extracted malware binary file information into three different grayscale images and fuses them into three-channel RGB images. We can analyze malware from a multi-dimensional perspective and effectively retain relevant information in the malware binary file.
We propose a new framework for malware classification using convolutional neural networks. We combined the ResNet34 convolutional neural network model with the model trained on the ImageNet dataset for transfer learning and compared the performance with other ResNet network models. This method does not require reverse analysis and can achieve a good training effect in a short time, effectively improve the accuracy of malware classification, and enhance the model’s generalization ability.
Compared with the sample images’ interception or filling method, this paper used image interpolation algorithms to deal with the size of grayscale images and compared and analyzed the performance of different interpolation algorithms. Without loss of image information, it can effectively avoid the loss of feature information.
The RGB images generated were processed by using the contrast limited adaptive histogram equalization (CLAHE) data enhancement method, which can better deal with the problem of data imbalance. At the same time, it can effectively limit noise amplification, expand the local contrast, and display more details of the smooth areas.
The rest of the study is organized as follows:
Section 2 introduces the traditional malware detection methods and related research on using visualization techniques to classify malware.
Section 3 introduces in detail the visual characteristics of multi-channel RGB images based on transfer learning and the new framework of the malware classification method based on the ResNet convolutional neural network.
Section 4 gives the experimental results and makes a comparative analysis of the experimental data. Finally,
Section 5 summarizes the study and provides some ideas and suggestions.
3. Methodology
To improve the accuracy of malware classification, we propose a malware classification method using multi-channel image visual characteristics and a convolutional neural network, which is based on transfer learning. It includes three components: feature extraction, the generation of multi-channel images, and the construction of convolutional neural networks for classification and detection. The framework of our method is shown in
Figure 1. Firstly, the malware binary files are converted into three different grayscale images, and the grayscale images are zoomed in and out using a bilinear interpolation algorithm to unify the size. Then, they are fused into three-dimensional RGB images as feature images, and the CLAHE algorithm is used to enhance the synthesized feature image. Finally, the transfer learning method is combined with the improved convolutional neural network to classify and detect malware samples.
3.1. Image Representation of Malware
Generally, the color depth of the points in a black-and-white image is called grayscale, and the range is 0–255 (0 represents black; 255 represents white). Therefore, black and white images are also grayscale images. To generate feature images, it is necessary to convert the information in the malware file. In this paper, we propose three conversion methods to generate grayscale images. ASCII images are generated using asm files, and hexadecimal and entropy images use byte files. This is demonstrated using four malware samples, generating different grayscale images, as shown in
Figure 2, where each rectangle (image) in Ramint (a), Remmit (b), and Remmit (c) has a one-to-one correspondence; they are generated by the same malware using three different methods.
3.1.1. ASCII Images
The original byte in the malware binary file can be converted into an 8-bit gray pixel. The byte sequence is segmented by a fixed width and converted into a two-dimensional gray matrix. Firstly, we converted the byte sequence into consecutive ASCII visible characters. Then, the width of the fixed generated image was 256 pixels, so the first 256 bytes in the binary file were taken as the first row of pixels, and the bytes between 257 and 512 were taken as the second row of pixels, and so on. When the byte sequence of the last line was less than 256 bytes, 0 × 00 was used for filling. Finally, the height of the generated ASCII image depends on the size of the malware sample file.
Figure 2a shows the generated malware ASCII image.
3.1.2. Hexadecimal Images
We converted the hexadecimal value in the malware binary file into a grayscale pixel value (in the range 0–255) for every 8 bits, and the output image was saved in PNG format. In this paper, different sizes of intervals were divided according to the sample size of the dataset used, and the results are shown in
Table 1. For example, when the size of the malware sample file was less than 10 KB, the image’s width was set to 64 pixels, and the height was set according to the sample file size. When the last line was less than the length of the set width, it was filled with 0 × 00. The generated hexadecimal image of the malware is shown in
Figure 2b.
3.1.3. Entropy Images
By calculating the entropy of the malware sample file to generate grayscale images, the entropy can represent the confusion degree of the byte values in the sample file. It can reflect the similarity of the sample file. By observing the generated pixel values, we found that the entropy values had a small change range. This may be the reason why we magnified entropy values. We refer to the way Fu et al. [
43] generated the entropy images. The entropy is calculated as shown in Equation (
1):
where
represents the probability of the occurrence of byte value
i. When the byte values in the intercepted byte sequence are the same, the entropy value is 0; otherwise, the entropy value is 8. To facilitate the observation of the generated entropy images, we scaled the entropy values from 0 to 255.
Figure 2c shows the generated malware entropy image.
3.2. Interpolation Algorithms
When converting malware binary files to grayscale images, the size of the generated image specifications is not uniform. Therefore, we needed to standardize the generated grayscale images. To avoid the loss of information and retain the features of the sample file to the greatest extent, we used interpolation algorithms to scale the grayscale images. In the following, we will briefly introduce the nearest neighbor interpolation algorithm, bilinear interpolation algorithm, and bicubic interpolation algorithm. In the
Section 4, we will compare and analyze the three interpolation algorithms through experiments and select the best interpolation algorithm to process the ASCII image, hexadecimal image, and entropy image so that they are unified into 256 × 256 sizes. The grayscale images processed by different interpolation algorithms are shown in
Figure 3.
3.2.1. Nearest Neighbor Interpolation Algorithm
The nearest neighbor interpolation algorithm is simple in operation and requires less computation. Generally, the gray value of the pixel closest to the location of the sampling point to be measured is selected as the gray value of the sampling point. The generated image is shown in
Figure 3a. Let point
be the sampling point to be measured and (
), (
), (
), and (
) be the pixel coordinate points of the original image. Because the (
) coordinate point is closest to the sampling point to be measured, so the interpolation result is
. Although the nearest neighbor interpolation algorithm is very fast, the effect is not good and will produce obvious sawtooth and mosaic phenomena.
3.2.2. Bilinear Interpolation Algorithm
The bilinear interpolation algorithm can overcome the defects of the nearest neighbor interpolation algorithm; the effect is better, but the operation speed is slightly slower. Calculate the linear interpolation of two variables,
B and
C, respectively, in the
x direction. Then, conduct the linear interpolation in the y direction and obtain the sampling point to be measured through four adjacent points. Its core idea is linear interpolation in two directions, and the generated image is shown in
Figure 3b. The linear interpolation formula in the
x-direction is:
The linear interpolation formula in the y-direction is:
Finally, the interpolation result
f(
x,
y) obtained by using the bilinear interpolation algorithm is obtained:
3.2.3. Bicubic Interpolation Algorithm
The bicubic interpolation, also known as cubic convolution interpolation, is the most-complex algorithm, with a large amount of computation and a long running time. However, it can produce smoother edges and achieve good results. The algorithm uses the gray values of sixteen points around the sampling point to perform the cubic interpolation, which considers the influence of the four adjacent points’ gray values and the influence of the change rate of the gray values. The generated image is shown in
Figure 3c, and the formula for calculating the pixel value
is as follows:
where
is used to calculate the pixel point’s weight and the weight’s value is related to the distance between the pixel points.
3.3. Contrast Limited Adaptive Histogram Equalization
Through the linear interpolation algorithm, the grayscale images generated by three different methods were zoomed in and out to generate 256 × 256 size images. In the next step, we performed the fusion operation. The ASCII images were used as the first channel, the entropy images as the second channel, and the hexadecimal images as the third channel to form three-dimensional RGB images. Then, the CLAHE algorithm was used to enhance the generated RGB image. At first, the image was partitioned into sub-regions of equal size, and the histogram of each block was calculated. Then, define a threshold
, and if the histogram of a block exceeds the defined threshold, the histogram is trimmed from the top, and the portion exceeding the threshold is evenly distributed to each gray level until equalization is achieved. Finally, the center point of each sub-region was used as the reference point to obtain the gray value, which was then the interpolation algorithm used to process it to obtain the enhanced image. The CLAHE algorithm is an improvement of the adaptive histogram equalization algorithm, which can enhance the image’s contrast while suppressing noise. The three-channel RGB images processed using the CLAHE algorithm are shown in
Figure 4, and the formula for the threshold
is shown below:
where
P denotes the number of pixels in different blocks and
Q is the gray levels;
represents the truncation factor, ranging from 0 to 100;
represents the maximum allowed slope, which determines the amplitude of contrast enhancement.
3.4. Transfer Learning
Training a new convolutional neural network may require iterating dozens of epochs to achieve good training accuracy. At the same time, if the convolutional neural network is very complex, the number of parameters is very large and the dataset is very small, it will not be sufficient to train the whole network. Thus, overfitting can occur and affect the accuracy of classification detection. Therefore, we used the transfer learning method to solve the above problems. The use of transfer learning can quickly train an ideal result in a short time, and it is suitable for small datasets. Using the pre-trained model parameters on the ImageNet dataset, we applied them to our convolutional neural network model. This is equivalent to taking the model developed for Task A as the initial point and reusing the relevant parameters of Task A in the model developed for Task B.
At present, there are three common transfer learning methods: training all parameters after loading weights, only training the last several layers of parameters after loading weights, and adding a fully connected layer based on the original network model after loading weights and training only the last fully connected layer. This paper adopted the method of training all parameters after loading weights, and the classification and detection results were better than the other two methods. First, all the pre-trained model parameters were loaded into the ResNet34 convolutional neural network. Then, the network parameters of all layers in the ResNet34 convolutional neural network were trained according to the dataset used. Finally, we modified the number of nodes in the last fully connected layer.
3.5. Convolution Neural Network Classification
Theoretically, the deeper the layers of the neural network are, the more complex the model structure is, the more comprehensive the features extracted will be, and the better the effect will be. However, the effect is not good in practical applications. At the same time, the problem of gradient disappearance or gradient explosion becomes more serious when the convolutional neural network is stacked to a certain depth. Therefore, we used the ResNet convolutional neural network to classify malware images to prevent the above problems and obtain better classification and detection results. ResNet was first proposed by Microsoft Lab and was the champion of the ImageNet competition in 2015. The network proposes a residual network structure and builds an ultra-deep network structure (over 1000 layers). At the same time, it uses batch normalization to speed up the training (discard the dropout). Compared with other convolutional neural networks, ResNet can greatly improve the system’s robustness and significantly improve the effect of classification detection.
Figure 5 shows the proposed convolutional neural network architecture.
We took a multi-channel RGB image of size 256 × 256 as the input and the improved ResNet34 convolution neural network model was used for classification detection. First, input the image into a 7 × 7 convolutional layer with the size of the convolutional kernel, then through four convolution groups, including 16 residual network structures (3 dotted line residual network structures and 13 solid line residual network structures). Then, the downsampling of the first convolution group is realized by a maximum pooling layer with a 3 × 3 convolution core size. The remaining convolution group’s downsampling is realized by the residual blocks adjacent to the previous convolution group. Finally, we output through the average pooling layer and the fully connected layer and used the softmax function to convert our output into a probability distribution. In the training phase, the convolutional neural network model was trained using the cross-entropy loss function, and the model parameters were learned using the ADAM optimizer. The formula of the cross-entropy loss function is as follows:
where
N is the number of categories,
represents the true distribution of the samples, and
denotes the distribution predicted by the model.
Compared with the ResNet34 convolutional neural network model, we set the steps of the first and second convolutional layers of the dashed residual network structure to 1 and 2, respectively. The GELU activation function was used, and the model parameters trained on ImageNet were used for the transfer training. Therefore, this can accelerate the training speed and enhance the model’s generalization ability. At the same time, the classification speed was significantly improved, and the accuracy would be better.
5. Conclusions and Prospects
This paper presented a new method of malware detection. First, three different methods were used to generate grayscale images, and an interpolation algorithm was used to deal with the problem of image scale. The generated grayscale images were fused into multi-channel images. Then, we pre-trained the parameters of the model on the ImageNet dataset and combined this with the ResNet34 convolutional neural network. Finally, the improved model was used to classify the images. This method does not require feature engineering and reverse analysis and can directly extract features for training. The model uses the data enhancement method to solve the overfitting problem caused by sample imbalance.
In order to better evaluate the performance of the method proposed in this paper, the existing ResNet series of networks was extensively evaluated, and the grayscale image and color graphics were also compared and analyzed. The experimental results showed that the malware detection method proposed in this paper achieved 99.99% classification accuracy and a 99.35% F-score. The designed model has strong generalization and excellent classification ability, effectively improving malware classification accuracy.
In future work, we will classify and detect malware in reality and study a more diversified and larger collection of malware family samples. To ensure that the data results are more convincing and to avoid random errors, the next experiment will use the cross-validation method to evaluate the average results quantitatively. When processing the size of grayscale images, it is easy to cause the loss of information. Later, we hope to input images of any size into the model.