This section describes the introduced datasets and the prediction methods in detail.
2.1. Datasets
The datasets used in this study were extracted from two distinct platforms, namely Autovit.ro and Mobile.de, which are prominent websites dedicated to selling second-hand cars in Romania and Germany, respectively. Autovit.ro primarily caters to the Romanian market, with a localized focus limited to the country’s geographical boundaries. In contrast, Mobile.de is a broader platform that serves the German second-hand car market, known for having the largest market share within the European Union. However, it is worth noting that Mobile.de is also widely used by individuals in neighboring countries, including Romania. Overall, 30,264 car ads were scraped from Autovit.ro, while 1,308,575 entries were extracted from Mobile.de, on March 2023. As per the terms and conditions statements of each website, posting duplicate ads of the same vehicle is not allowed, and there are methods in place to remove the ads violating this rule. This implies that no duplicate vehicles were part of our dataset. While the same features were extracted from both websites, variations in the feature values were observed, requiring subsequent post-processing steps for normalization. The features considered relevant for the purpose of this investigation encompassed the car brand, car model, year of manufacture, mileage, engine power, gearbox type, fuel type, engine capacity, transmission, car shape, color, add-ons, images, and price.
An outlier filtering technique was employed to ensure the integrity of the data and eliminate spurious ads that could adversely impact the training and prediction processes. This filtering procedure was conducted alongside additional pre-processing steps to maintain data quality.
The subsequent sections detail the pre-processing steps undertaken for both datasets unless otherwise specified.
Mobile.de ads do not explicitly contain the categorized car brand and model but rather a title written by the seller. We extracted these two relevant features from the title using a greedy approach of matching them against an exhaustive list of all car brands and models and choosing the closest fit. Finding a category was impossible for some ads, and these entries were dropped from the dataset.
We discarded the ads that did not contain the relevant features mentioned above and those that did not contain at least an image of the car’s exterior. As some sellers published multiple images, some irrelevant to the ad or not showing the entire vehicle, we only considered images that contained the full car exterior. This filtering was carried out with the help of a YOLOv7 model [
30] that detected a bounding box for a car image. Images displaying multiple cars without a prominent focus (e.g., parking lots) or with car bounding boxes occupying less than 75% of the entire image size were removed from the dataset.
To maintain precision and minimize the presence of erroneous data, listings with questionable features were eliminated, as they could potentially contain inaccurate information. Thus, we excluded cars with a manufacturing year before 2000, a mileage exceeding 450,000 km, a price surpassing EUR 100,000, and an engine power exceeding 600 horsepower.
The dataset was split randomly into an 80% training set and a 20% validation set; however, we ensured a balanced distribution of car brands in each subset. Notably, a car advertisement could include multiple images, resulting in multiple entries within the dataset (one for each image). However, measures were taken to ensure that the training and validation sets did not contain the same advertisements but rather different ads, each with their respective images.
Within the training dataset, we calculated each car model’s mean and standard deviation. Subsequently, we removed outlier listings from the entire dataset that fell outside the range defined by , where and are the mean price and standard deviation calculated for each car model, respectively. When considering car models with less than 20 instances in the dataset, we calculated the mean and standard deviation for the car manufacturer instead to ensure meaningful measurements, because the car manufacturer category contained an adequate number of entries for each group. We determined the mean and standard deviation on a per-model basis for frequently represented car models (i.e., with over 20 sale ads).
This filtering removed 15,253 entries from Autovit.ro and 1,001,324 entries from Mobile.de; thus, our datasets retained 15,011 and 307,251 unique entries, respectively. Moreover, 59,450 images were available for Autovit.ro ads, while 1,628,546 images from Mobile.de were kept.
A thorough analysis of both datasets is presented below. We based our choice of experiments on this analysis to make the most of the data.
The car brand and model were the most relevant categorical features for our task.
Figure 1 and
Figure 2 depict the frequency distributions of the top 10 most popular car brands in the datasets. Notably, a similarity emerges from the figures, as they reveal a considerable overlap in the most frequently occurring models between the German and Romanian markets. This alignment could be attributed to the substantial influx of car imports into Romania, particularly in the form of second-hand vehicles originating from Germany.
The next features had the same distribution and tended to follow the same pattern from one dataset to another, even though their values were not considered when creating the training and validation partitions. The year of manufacture data in
Figure 3 and
Figure 4 show that the majority of vehicles were manufactured between 2015 and 2020, with an approximate age of around 5 years at the time of the ad posting.
In terms of mileage, roughly 5% of the cars (i.e., 728 from the Romanian dataset and 21,171 from the German ads) had less than 5000 km, making them candidates for new or almost new vehicles. Here, the difference between the two datasets was more striking, as Mobile.de ads tended to become less frequent as the mileage increased. At the same time, a high number of vehicles from the Romanian market were sold at around 200,000 km, a tendency also shown in
Figure 5 and
Figure 6.
In terms of engine power, the values were measured in horsepower, and the vast majority of advertised cars had a value between 100 and 200 HP. Off-value ads had a lower frequency for both datasets, especially after the 300 mark. However, Mobile.de also advertises luxury cars with a high engine power given its wider market (see
Figure 7 and
Figure 8).
Table 1 highlights the distribution of a subset of features across both datasets and partitions. A difference was observed in the gearbox category between the two datasets. The distribution on Autovit.ro highlighted a strong preference for automatic cars, with around 50% more automatic vehicles than manual ones. The gearbox type tended to have a meaningful impact on the selling price. However, the difference was small in the German market, and the numbers were balanced between the two classes. The fuel type highlighted another striking difference between the two datasets. Diesel cars were advertised on Autovit.ro more than gasoline-based ones by a large margin. Although this preference was also be observed on Mobile.de, the vehicles were more evenly balanced. However, in both datasets, oil-based fuel was strongly preferred in comparison to alternatives. The engine capacity was a relevant feature, since it influences car tax. It had a similar distribution in both datasets, with most vehicles advertised at around 2000 cm
3. In terms of transmission type, a preference for 2 × 4 transmission was observed in both datasets, as an integral transmission increased the price; this difference was more pronounced in the Mobile.de dataset. However, it should be noted that while sellers on Autovit.ro were asked to choose the transmission type, on the German website, users had the possibility of adding the 4 × 4 feature as an add-on, and many may have omitted this step.
The car shape also had a different distribution among the two datasets, as SUVs are prevalent in the Romanian market. At the same time, the Mobile.de website presented a more balanced distribution (see
Figure 9 and
Figure 10).
The color did not have a high impact on the price, although specific car models had a default color with a lower price than other options (see
Figure 11 and
Figure 12).
In addition to the primary features, the vehicle owners may have appended a list of supplementary attributes for their cars in the advertisement. These add-ons were presented as an unordered list with string-based categorical values. The add-ons for a given vehicle could include up to 180 distinct categories for Autovit.ro and 120 for Mobile.de, with the most prevalent ones displayed in
Table 2.
Finally, the predicted feature was the price. Here, both datasets showcased similar distributions (see
Figure 13 and
Figure 14), with the highest number of cars advertised at below EUR 20,000. It should be noted that the price represented the owner’s asking price, so this may not have reflected the real market value of the car. Furthermore, the prices categorized by the most popular manufacturers displayed noteworthy variations in terms of value, with certain brands exhibiting a broad spectrum of potential prices. In contrast, other brands had values concentrated within a narrow range, as illustrated in
Figure 15. It should also be noted that cars advertised on Autovit.ro had a higher mean asking price than cars of the same brand on Mobile.de. An underlying reason is that some of the cars from the Romanian market were bought from Germany and resold at a higher price in Romania.
Overall, we observed that the Mobile.de dataset had more evenly distributed numerical features and a more granular and diverse range of categorical features. In contrast, the Romanian market tended to be more biased towards certain car types.
The initial representation of the features remained largely consistent across all experiments, with only minor variations based on their respective types.
Table 3 provides an overview of the features in their original state, while the subsequent model descriptions document any specific modifications made to them.
2.3. Neural Network Methods
We conducted various experiments involving different neural network architectures to enhance the learning of inter-feature relationships and optimize the representation of the car model and its manufacturer. All neural network architectures were trained to learn an embedding that optimally represented the car model and its brand. However, the training process failed to converge for certain infrequent car models. To address this issue, we employed a mapping procedure whereby the embedding for models with fewer than 20 occurrences was learned for the manufacturer rather than for the model itself. To encode the name of the car more formally, we considered the following method:
Furthermore, the price was scaled based on the mean and standard deviation computed per car model (or brand, for models with a frequency of fewer than 20 entries). As such, the predicted price was computed as:
After the model performed the predictions, the final price was evaluated with the inverse transformation of Equation (
2).
An embedding for the rest of the categorical features was learned during the training process. Numerical features were scaled as described in
Table 3. In terms of add-ons, we experimented with several ways of including them in the training. The listing below offers a detailed description of how add-ons were encoded and used as features.
A general architecture (see
Figure 16) was used in all the experiments. Unless a variation is mentioned regarding image processing or add-on representation, the neural network had the same general structure with different hyperparameters fine-tuned for each special case. The numerical features were used as-is, while an embedding layer trained its weights to learn an optimal representation for each categorical feature in the current context. The network component responsible for handling add-ons differed from one approach to another and is detailed for each variation. All these representations were concatenated and served as the input for a stack of dense layers of different sizes to learn and represent the complex relationships between the input features and their corresponding outputs. The final dense layer computed the model output and predicted the price value.
2.3.1. Neural Network with Hot-Encoded Add-On Projection
In this case (see
Figure 17), the add-ons were hot-encoded to account for their presence or absence. We encoded them with −1 for their absence and 1 for their presence since these values were forwarded to a dense layer with a tanh activation function.
2.3.2. Neural Network with Mean Add-On Learned Embeddings
In order to better learn a representation for each add-on and a representation of what the absence of that option meant, add-ons were aggregated using trainable embeddings (see
Figure 18). For each potential add-on, two embeddings were computed: one to represent its presence and another to represent its absence. As a result, each vehicle was allocated an equal number of attributes pertaining to add-ons. The average of these embeddings was then considered an aggregation of the vehicle’s characteristics.
2.3.3. Neural Network with Add-On Embeddings and Multi-Head Self-Attention
Another approach similar to the previous method involved learning a contextualized representation of these add-ons (see
Figure 19), which accounted for more relations than aggregating the average. As previously stated, a self-attention layer was used after computing the embeddings. The multi-head self-attention determined how much each individual add-on contributed to the representation of other add-ons, facilitating the identification of relevant dependencies and capturing long-range dependencies. Self-attention enabled the network to focus on different parts of the input adaptively. After the self-attention was applied, the output was averaged to obtain an aggregated representation.
2.3.4. Deep Neural Networks for Image Analysis and Add-On Multi-Head Self-Attention
In order to use all available information in the dataset, we adopted a comprehensive approach by incorporating numerical, categorical, and image attributes. The previously described features were aggregated using the above experimental method, the neural network with add-on embeddings and multi-head self-attention. Additionally, we integrated information about the vehicle’s image into our model. To achieve this, the image was reshaped into a three-dimensional array of dimensions (224, 224, 3), enabling us to capture its visual characteristics. Subsequently, we employed a pre-trained convolutional neural network (CNN) architecture to extract contextualized information from the image. The resulting image projection was combined with the numerical projection obtained from the previous features. The concatenated attributes were then passed through a stack of dense layers, enabling the network to learn complex relationships and patterns within the data. Finally, the prediction was computed based on the processed features, resulting in a comprehensive and informed output based on both the declared car characteristics and images provided in the ad.
For the purpose of this approach, we experimented with a pre-trained convolutional neural network architecture, namely EfficientNet [
21], and a transformer-based architecture, Swin transformer [
25]. Both architectures were selected based on their proven state-of-the-art results on various computer vision tasks and their transfer-learning ability. Although used for classification tasks, we removed the classification head from EfficientNet and used the last hidden states of both as a representation of the image characteristics. This representation was then sent into our downstream task of price regression. Although the weights of the pre-trained networks were initialized with their published values, we kept all layers trainable to allow the network to dynamically adjust and adapt to our specific task.
Figure 20 presents a detailed visual representation of this architecture.