1. Introduction
The real estate market has a significant economic impact and a broad range of social effects. It plays a crucial role in the global economy by fulfilling people’s fundamental housing needs. Additionally, it promotes economic growth, creates job opportunities, and enhances social well-being [
1,
2,
3]. In recent years, the transaction volume of second-hand housing has accounted for a significant proportion of the real estate market. In 2023, the volume of second-hand housing in European and American countries, such as the United States, the United Kingdom, and France, accounted for 89%, 83%, and 74% of the total, respectively, while the volume of Asian countries, such as China, South Korea, and Japan, accounted for 40%, 36%, and 42%, respectively. Second-hand housing constitutes an integral part of the housing market. They are located in relatively mature communities with complete supporting facilities. Thus, more choices can be provided for homebuyers [
4,
5]. However, influenced by location and price, the sales cycle of second-hand housing demonstrates significant diversity. This leads to residents often being unable to accurately and swiftly assess the sales cycle of second-hand housing when selling or purchasing, resulting in failed transactions. Consequently, the study of the sales cycle of second-hand housing is imbued with significant practical implications and theoretical value.
From the perspective of the research object, in the field of second-hand housing transactions, the current research mainly focuses on the prediction of second-hand housing prices, and there are few studies on the sales cycle. Additionally, most studies do not thoroughly identify the factors influencing second-hand housing. Instead, they often focus solely on individual attributes and regional characteristics, neglecting the effects of the broader macro environment and market performance [
6]. Based on the data from 38,363 second-hand housing transactions in Chengdu, China, collected by Ke.com, Zhang et al. [
7] used a random forest model to predict the price of second-hand housing, and the error of the model was 2.61%. However, their research focuses on the attributes of second-hand housing, the indicators involved are relatively limited, and the diversity and comprehensiveness of indicators still need to be further expanded. Gao et al. [
8] incorporated the housing location factor into their study and established a multi-task learning (MTL) model. This model offers significant advantages over the traditional model, yet it still focuses primarily on the characteristics of the second-hand house and its location. Li et al. [
9] identified several influential factors for their research, including the basic characteristics of the housing, the structure information, the community environment, and the supporting facilities within the community. These factors were utilized to assess transactions involving second-hand homes. Lu et al. [
10] combined various regression models to predict house prices, and the estimates of each regression model were averaged to produce the final result. However, the factors they considered were limited to the number of bedrooms, the size of the house, and the available amenities, among others. The research conducted by Park and Bae [
11] examined only the physical characteristics and supporting facilities of houses. In their findings, the selection of influencing factors was overly narrow, relying on the traditional hedonic pricing theory. This approach focused on three main factors: the physical attributes of the housing, location characteristics, and the neighborhood environment.
Unlike the aforementioned scholars, DiPasquale and Wheaton [
12] incorporated macroeconomic variables into their analysis to predict housing prices. They discovered that these macroeconomic factors can enhance the accuracy of housing price predictions. Additionally, Zhong et al. [
13] underscored the strong connection between macroeconomic factors and the real estate market, noting that this relationship influences the market at various levels. Zhang [
14] pointed out that traditional house price prediction models typically rely on numerical attributes, such as the number of rooms, but overlook the textual descriptions of the houses. To address this, he collected descriptive information from second-hand houses in five cities across Ontario, Canada, and developed a deep-learning model. The model achieved an accuracy of 0.7904, indicating that there is still room for improvement. The research conducted by He X and Xia F [
15] highlights the significance of individual psychological and emotional factors in the real estate market at the micro level. While these factors are challenging to quantify, they are crucial in understanding the second-hand housing market. Additionally, Gdakowicz et al. [
16] pointed out that various internal and external factors interact throughout the process, from the initial listing of houses to the final transaction, ultimately influencing the duration of the sales cycle. Based on the above research results, a variety of macro and micro factors needed to be considered to ensure the comprehensiveness and accuracy of the prediction results.
The Stimulus–Organism–Response (SOR) model, proposed by Mehrabian and Russell, offers a comprehensive explanation of the mechanisms behind consumer purchasing behavior. In this model, ‘Stimulus’ (S) refers to the external environmental factors influencing consumers’ cognition and emotions. ‘Organism’ (O) serves as a mediating variable, representing the internal state of the consumers. Finally, ‘Response’ (R) indicates the external behavioral reactions of consumers, which are typically expressed as either approaching or avoiding certain stimuli. The framework suggests that various stimuli influence consumers’ purchasing behavior. These stimuli generate motivation, which drives consumers to make purchases. After purchase, consumers evaluate the products and the purchasing channels and manufacturers involved [
17]. Considering second-hand housing transactions as a consumer purchasing behavior, the SOR theory is applicable. The SOR theory is introduced in this study to identify factors affecting the sales cycle of second-hand housing. In this context, ‘Stimulus’ refers to the macro and external environmental factors that affect the second-hand housing market. ‘Organism’ pertains to the characteristics of the second-hand house and its surrounding area. Finally, ‘Response’ represents the overall reaction of the second-hand housing market to these stimuli and organisms.
The authors note that no study proposes a quantitative method to accurately predict the sales cycle and comprehensively identify various influencing factors. Therefore, we have chosen to focus our research on the sales cycle of second-hand houses in order to address this gap in the field.
From the perspective of research methods, there is currently a lack of quantitative methods to study the sales cycle. Therefore, collecting appropriate data, using appropriate methods to characterize the nonlinear relationship between influencing factors, and accurately and quickly predicting is the key to our research. The real estate transaction period is lengthy, and obtaining the necessary data can be challenging. Traditional data collection methods mainly rely on statistical data or personal surveys [
18]. Roy [
19] used statistical data compiled by government agencies to conduct an in-depth study of urban housing demand. Feng et al. [
20] collected the original data of commercial residential buildings in Beijing through field investigations. Although the above two methods of obtaining data ensure the authority of the data and the richness of the details, they are time consuming and labor intensive. With the development of science and technology, the online real estate platform provides us with new ideas [
21]. Globally, online real estate platforms, such as Lianjia in China, Zillow in the United States, PropertyGuru in Southeast Asia, and 99acres in India, are popular, offering services such as home purchases, sales, and rentals while working on data reliability and immediacy [
22]. Crawler is a technology that simulates browser behavior by writing programs to obtain data on the Internet automatically. It can effectively overcome the limitations of traditional methods in data collection, reduce costs, and improve the efficiency of data collection [
23]. In the field of real estate, more and more researchers are using Web crawler technology to obtain relevant data on online platforms. Lee et al. [
24] crawled the public information data provided by a real estate portal and predicted the value of a house accordingly. Yao et al. [
25] collected housing price data from China’s largest online real estate market and generated the spatial distribution of housing prices in Shenzhen, China, with a spatial resolution of 5 m. The research data of Song and Ma [
26] were crawled on Lianjia.com, while geocoding was performed in the Gaode Map API with physical addresses to correlate with other geographic data spatially. Kang et al. empirically analyzed housing prices based on crawled Web data [
6]. Based on these research findings, this study employs crawler technology to collect data aimed at improving the accuracy of sales cycle prediction.
Some studies utilize statistical methods for predictions. For instance, Lancaster et al. [
27] introduced multiple regression analysis in the field of real estate price evaluation. Pior et al. [
28] combined Geographic Information System (GIS) technology to assess the prices of second-hand housing. In their research, Xu and Zhang employed multi-scale geographically weighted regression (MGWR) [
29]. Recent studies have found that the performance of machine learning models is superior to traditional statistical methods, being able to not only deal with complex nonlinear relationships but also having higher computational efficiency [
30,
31]. Adetuji et al. [
32] established a random forest model to predict housing prices, and the error range was controlled within ±5%, showing the advantages of the machine learning model in prediction accuracy. Xu and Zhang [
33] used neural network models to predict housing prices, further demonstrating the potential of machine learning methods in efficiency and accuracy. Zhan et al. [
34] combined Bayesian optimization with ensemble learning technology to bring new solutions to housing price forecasting. Ge et al. [
35] compared artificial neural networks (ANNs) with logistic regression (LR) models and found that an ANN performed significantly better than LR in dealing with nonlinear relationships, which further confirmed the superiority of machine learning models in prediction problems. Although machine learning models are superior to statistical methods in terms of prediction efficiency and accuracy, the learning speed of some models is usually much slower than the required speed. Huang et al. [
36] noted this and proposed a new learning algorithm called an extreme learning machine (ELM), which can provide the best generalization performance with an extremely fast learning speed. The kernel extreme learning machine (KELM) uses the kernel function instead of the activation function of the hidden layer based on the extreme learning machine. In practical applications, the number of neurons in the hidden layer, input parameters, and system bias are not considered. The generalization ability and learning speed are better than that of the ELM [
37]. Considering that the KELM is highly effective for addressing problems with multiple inputs and nonlinear complex relationships and has a strong reliance on kernel functions, this paper presents a linear weighting approach for multiple kernel functions. It establishes a hybrid kernel extreme learning machine (HKELM). The model is intended to predict the sales cycle of second-hand houses.
The HKELM has multiple hyperparameters that need to be optimized. At present, the mainstream optimization algorithms are the GA [
38], QPSO [
39], SSA [
40], WOA [
41], GWO [
42], BWO [
43], HHO [
44], DBO [
45], etc. These algorithms and their variants are widely used in various engineering problems and parameter optimization. The Crested Porcupine Optimizer (CPO) was proposed by Abdel-Basset et al. [
46]. It has been tested in several optimization problems. Compared with the most advanced heuristic algorithms and traditional methods, it has strong competitiveness and broad application prospects. Zang et al. [
47] used the CPO to optimize the BiTCN-BiLSTM-SA model and predicted optical fiber network traffic based on this. Gao et al. [
48] used the CNN-BiLSTM-Attention model optimized using the CPO to predict the temperature of a bridge girder. However, some scholars have pointed out that the CPO has problems such as slow convergence and insufficient optimization performance [
49] and have improved the CPO. Zhang et al. [
50] deleted the population reduction mechanism and improved the update formula of the defense phase. Based on this, the LSTM model was optimized to predict the deformation of soft rock tunnels. Wang et al. [
51] introduced chaotic mapping to initialize the population to improve the CPO and optimized the DV-Hop algorithm based on this. Liu et al. [
52] enhanced the CPO based on Cauchy mutation, adaptive weights, and other strategies. The above improvements have achieved good results. At the same time, we noticed that Ling et al. [
53] introduced the Lévy flight strategy when improving the whale optimization algorithm (WOA), which effectively improved the algorithm’s convergence speed and global optimization ability. Wang et al. [
54] introduced the tangent flight operator when improving the Harris Hawk optimizer (HHO), which is innovative. Based on the above research results, we propose the following improvements to the CPO:
Using a Cubic map to initialize the population;
Deleting cycle population reduction;
Introducing a hybrid mutation strategy;
Introducing a tangent flight strategy.
Therefore, this paper intends to select the CMTCPO-HKELM method to study the sales cycle of second-hand houses. The main contributions are as follows: (1) In this paper, the SOR model is used for the first time to analyze the influencing factors of the second-hand housing sales cycle, which provides a new perspective for the research in this field. (2) This paper presents an improved optimization algorithm for crested porcupines, which has better computational performance. Based on this, this paper also develops a novel prediction model of the second-hand housing sales cycle, which has achieved rapid and accurate prediction.
The remaining chapters of this article are arranged as follows: The second section shows the research methods proposed in this paper in detail. The third section will select eight cities in China with different economic conditions to carry out case studies. The fourth section discusses the research details of this paper. The fifth section summarizes the research content and limitations of this paper.
2. Materials and Methods
2.1. Influencing Factors of the Sales Cycle of Second-Hand Housing
2.1.1. Factors Related to Stimulation (S)
The stimulus characterizes the external environmental factors of the second-hand housing market, which profoundly affect the sales cycle of second-hand housing. The factors related to the stimulus can be divided into three categories: policy, economy, and market supply and demand.
- 1.
Policy factors:
Policies implemented by the government primarily encompass monetary policy and tax policy. They intervene in the real estate market at the transaction level, triggering fluctuations in house prices [
55,
56,
57]. In second-hand housing transactions, buyers focus on purchase costs, making policy factors significantly influence the second-hand housing market. Relevant research shows that credit policies include loan interest rates, the determination of down payment ratios, and restrictions on loan terms and economic conditions. When credit policies are suitable for homebuyers, people’s willingness to buy significantly improves [
58,
59,
60]. At the same time, tax fluctuations also impact house prices [
61]. In this study, we selected ‘loan rate’ and ‘down payment ratio’ to characterize credit policy and ‘tax ratio’ to characterize tax policy.
- 2.
Economic factors:
Economic factors stimulate the second-hand housing market in two ways: regional economic conditions and buyers’ purchasing power. Relevant research points out that the real estate market is obviously affected by economic factors, and the downward trend of regional economic conditions has dramatically weakened the purchasing power of consumers [
62]. On the one hand, consumers’ desire to buy a house depends on their own housing needs and future purchase plans. On the other hand, it also depends on consumers’ actual disposable funds and consumption capacity. In this study, ‘regional GDP growth rate’ is selected to characterize the regional economic situation, and ‘per capita disposable income’ and ‘unemployment rate’ are selected to characterize the consumption ability of buyers.
- 3.
Market supply and demand:
Second-hand housing transactions are often affected by the stock of new housing and the supply-and-demand relationship of second-hand housing. When the supply of new homes significantly affects the housing stock, the price of new homes will affect the price of existing homes [
63]. At the same time, there is an imbalance between the supply and demand of second-hand housing. The surge in listings over some time will lead to the blocking of second-hand housing transactions. In this study, ‘new house sales cycle’ is selected to characterize the stock of new houses, and ‘second-hand house listing volume’ and ‘second-hand house trading volume’ are selected to characterize the supply-and-demand relationship of second-hand houses.
2.1.2. Factors Related to Organism (O)
The organism characterizes the internal state of the second-hand housing market, that is, the attributes and additional characteristics of second-hand housing. According to the hedonic pricing theory, the factors associated with the organism can be divided into three distinct classes: housing characteristics, location attributes, and supporting facilities.
- 1.
Housing characteristics:
When consumers purchase goods, they are satisfied by consuming and enjoying the characteristic utility of the goods, so the characteristic attribute of the goods determines the price. In the second-hand housing market, housing characteristics affect the second-hand housing transaction. It is noted that relevant research on housing prices usually uses factors such as ‘building area’, ‘house age’, ‘the number of living rooms’, ‘the number of bedrooms’, ‘the number of toilets’, ‘structural form’, ‘the total number of floors’, ‘floor number’, ‘orientation’, and ‘decoration degree’ to establish a prediction system. This study uses the above factors to characterize housing characteristics.
- 2.
Location attributes and supporting facilities:
As a key element, the surrounding environment of the house is closely related to human behavior patterns and experiences [
64]. Second-hand houses with a superior geographical location are usually located in areas with convenient transportation and perfect supporting facilities. They can provide buyers convenient commuting conditions, high-quality educational resources, rich commercial facilities, and good medical services. These factors have an important impact on the living experience and the value of second-hand houses. Some scholars have introduced service accessibility, traffic accessibility, accessibility of high-quality educational resources, accessibility of shopping centers, greening rate, and other factors to establish an index system when studying second-hand housing transactions [
8,
9,
10]. Therefore, this paper chooses ‘distance from the city center’, ‘distance from the subway station’, ‘the number of educational resources’, ‘greening rate’, and ‘floor area ratio’ to characterize the location attributes and supporting facilities of second-hand houses.
2.1.3. Factors Related to Response (R)
The reaction characterizes the comprehensive reaction of the second-hand housing market to stimuli and organisms. Factors related to the reaction (R) can be divided into the groups housing price and market performance.
- 1.
Housing price:
The price of second-hand housing is divided into the listing price and transaction price. A change in the listed price significantly affects the trading results [
65]. A higher listing price may reduce the purchase intention of buyers, and a reasonable listing price may attract more potential buyers. The transaction price also affects the seller’s pricing strategy and the buyer’s purchase intention. A decline in the transaction price may encourage more buyers to enter the market, and an increase in the transaction price indicates that the market competition is fierce, and sellers are more confident in holding the property in anticipation of higher returns. The rise and fall of housing prices directly affect the activity of housing transactions and market expectations. The sharp rise in house prices may increase buyers’ sense of urgency, and the expectation of further appreciation of the value of real estate prompts them to speed up their purchase decisions. When house prices begin to fall, buyers may wait and see, expecting prices to fall further [
66]. Therefore, this study uses ‘listed price’, ‘transaction price’, and ‘the rise and fall of house prices’ to characterize housing prices.
- 2.
Market performance:
The market performance of second-hand housing is closely related to the average listing time, the number of times to see, and the rate of change in trading volume. There is a correlation between the average listing time and the price of real estate transactions [
67]. A shorter listing time may indicate that the property is superior or reasonably priced to buyers’ satisfaction. In comparison, a more extended listing time may mean that the property is less attractive or overpriced [
68]. The number of visits reflects the degree of interest of potential buyers in the property. High-frequency band viewing may increase a house’s exposure, increasing the transaction opportunity. A low viewing frequency may mean that the property’s attractiveness in the market is limited [
69]. The volume change rate refers to the percentage increase or decrease in real estate volume in a certain period. The increase in trading volume may indicate that the market is active and demand is strong, while the decrease in trading volume may indicate that the market is depressed and demand is weak. Therefore, this study selects ‘average listing time’, ‘number of times to watch’, and ‘volume change rate’ to characterize the market performance.
Based on the above analysis, we have designed a forecasting index system for the sales cycle of second-hand housing, including three primary and thirty-three secondary indexes. O6, O8, O9, and O10 are qualitative, while the others are quantitative. The specific details are shown in
Table 1. It should be noted that the number of floors represented by the influencing factor O8 belongs to privacy data. In order to protect the privacy of customers, the real estate platform only provides that the house is located on low floors, middle floors, or high floors, so the influencing factor O8 is classified as a qualitative indicator.
2.2. Relevant Data Acquisition Methods
In this study, the secondary index data of ‘stimulus’ (S) and ‘response’ (R) mainly come from the official government website and the authoritative statistical yearbook, which are highly reliable. The secondary index data of ‘organism’ (O) is automatically collected from the Internet through Web crawler technology, which has a certain timeliness. Specifically, on the real estate information platform Lianjia (
https://nc.lianjia.com/, accessed on 22 November 2024), we collected data from 400 second-hand housing transactions for Shanghai, Shenzhen, Guangzhou, Chengdu, Qingdao, Nanchang, Luoyang, and Guilin, eight representative cities. These data cover all aspects related to the house’s property, including but not limited to key information such as the area, price, construction age, and house decoration style. In addition, the information about the location attributes and supporting facilities of the house is obtained from the Gaode map (
https://ditu.amap.com/, accessed on 7 December 2024), which covers the geographical location, traffic conditions, educational resources, medical facilities, and other aspects of the house.
The eight cities we selected have certain representativeness and are in line with China’s basic national conditions. Economically, these eight cities belong to the first-tier cities, second-tier cities, and third-tier cities. Geographically, Qingdao, Nanchang, and Shanghai are located in East China, Shenzhen, Guangzhou, and Guilin are located in South China, Luoyang is located in Central China, and Chengdu is located in Southwestern China. Politically, Shanghai, Guangzhou, Nanchang, and Chengdu are municipalities or provincial capitals, and the rest are prefecture-level cities.
Figure 1 shows the eight cities we chose.
2.3. The Predictive Model Proposed in This Research
2.3.1. Hybrid Kernel Extreme Learning Machine
An extreme learning machine (ELM) has the advantages of a simple structure, easy implementation, and fast calculation speed. It determines the parameters between the input layer and the hidden layer via random generation and then uses the Moore–Penrose generalized inverse to calculate the optimal parameters between the hidden layer model and the output layer model [
70]. In some engineering backgrounds, the performance of the ELM has been proven to be superior to the traditional BP neural network and SVM model [
36].
The ELM’s input parameters are randomly generated and fixed, so there is no need for an iterative solution. It only needs to calculate the parameters between the hidden and output layers, greatly improving the model’s calculation speed.
Figure 2 shows the basic structure of the ELM.
Given a data set containing
samples, let the sample
have d features, and
is the sample label. The single-layer feedforward neural network (SLFN) constructed by the ELM can be converted into Equation (1) to solve [
36]:
where
is the number of hidden layer neurons,
is the weight between the
jth hidden layer neuron and the output layer neuron,
is the weight between the input layer neuron and the
jth hidden layer neuron,
is the bias term, and
is the activation function.
For the convenience of calculation, Equation (1) can be expressed in the form of a matrix [
36]:
where
By using the Karush–Kuhn–Tucker (KKT) condition and singular value decomposition (SVD) to solve the optimization problem of Equation (2), we can obtain [
36]:
where
is the Moore–Penrose generalized inverse matrix of
.
In order to make the generalization ability of the ELM stronger, Huang et al. added the regularization coefficient
into Equation (2) and added
as the penalty term to obtain the following problem [
71]:
Huang et al. proposed the kernel extreme learning machine (KELM) model by using the kernel function instead of the activation function of the hidden layer [
37]. It can effectively avoid the dimension disaster generated by the traditional ELM and improve the generalization ability [
37]:
where
is the kernel matrix, and
is the kernel function.
At this point, the solution to Equation (2) is [
71]:
The regularization coefficient and the kernel parameter determine the performance of the KELM. The regularization coefficient improves the model’s generalization ability, and the kernel parameter determines the effect of sample mapping on the high-dimensional feature space. Common kernel functions include the Radial Basis Function (RBF), polynomial kernel function, and linear kernel function. The RBF has good local learning ability and poor prediction for samples beyond a certain range. On the contrary, the polynomial kernel function, as a global kernel function, can predict samples in a global range. Therefore, this study combines the RBF and polynomial kernel functions to construct a hybrid kernel extreme learning machine (HKELM) model.
The expression of the hybrid kernel function is:
where
is the weight,
is the parameter of the RBF, and
is the parameter of the polynomial kernel function.
Based on Equations (1)–(9), the predicted value
of sample label
is:
2.3.2. Crested Porcupine Optimizer
The CPO is a new heuristic algorithm inspired by the defense mechanism of the crested porcupine. When faced with threats, crested porcupines use four strategies: visual defense, auditory defense, olfactory defense, and physical attack to resist intruders. The basic process of the CPO is as follows [
46]:
Step 1: population initialization:
CPO is a population-based heuristic algorithm. Each CP in the population is continuously updated in the search space as a candidate solution. Assuming that there are
individuals in the CP population, the search space is d-dimensional, and the position matrix can be expressed using Equation (11) [
46]:
Each individual
can be initialized as Equation (12) [
46]:
where
and
are the upper and lower bounds of the search space of the solution, respectively, and r is a random number between [0, 1].
Step 2: cycle population reduction (CPR):
The individuals in the crown porcupine population will trigger the defense mechanism after being threatened, while the unthreatened individuals live normally. To simulate this process, the CPO introduces population cycle reduction technology to reduce the population size [
46]:
where
is the current generation,
is the population size of the
generation,
is the minimum population size,
is the initial population size,
is the maximum number of iterations,
is used to determine the number of cycles, and % is the remainder operation. The change in population size when
= 50 is shown in
Figure 3.
Step 3: exploration phase:
When the crown porcupine encounters a predator, it will erect and wave its sharp spines to warn the predator. This phenomenon can be expressed using Equation (14) [
46]:
where
is a random number obeying the normal distribution,
is a random number between
,
is the position of the individual in the
t-th generation,
is the optimal position of the population in the
t-th generation, and
is the position of the predator in the
t-th generation. It can be defined according to Equation (15) [
46]:
where
is an arbitrary random integer in
.
- 2.
Auditory defense:
The crown porcupine will produce noise to warn the predator. This process can be realized by Equation (16) [
46]:
where
is a random number between
,
and
are two random integers between
, and
represents a randomly generated binary vector.
Step 4: exploitation phase:
The crown porcupine emits a special smell, preventing the predator from approaching. This process can be realized by Equation (17) [
46]:
where
is a random number between
,
is a random integer between
,
is a randomly generated binary vector,
is an odor diffusion factor defined by Equation (18),
is defined by Equation (19) and is used to control the search direction, and
is a defence factor defined by Equation (20).
Here,
represents the value of the objective function, and
is a minimum value.
is a random vector, where the values are random numbers between
, and
is a random number between
.
- 4.
Physical attack:
When the predator attacks the crown porcupine, the crown porcupine will attack the predator. This interaction can be modeled by the physical process of inelastic collision, which can be realized by Equation (21) [
46]:
where
is a random number between
,
is the convergence factor, and
is the average force acting on the
ith predator, which is defined by Equations (22)–(24):
Here,
is the mass of the
ith individual in the
t-th generation,
is the final velocity,
is the initial velocity of the
ith individual in the
t-th generation,
is a vector, and the values are random numbers between
.
Step 5: balance between the exploration phase and the exploitation phase:
In order to balance the exploration phase and exploitation phase of the CPO, random numbers
from
and a constant
from
are introduced. The position update formula is summarized as follows [
46]:
Step 6: global optimal solution update:
The global optimal solution can be updated as follows [
46]:
2.3.3. CMTCPO: An Improved Crested Porcupine Optimizer
The traditional CPO has problems such as slow convergence, easy-to-fall-into local optimum, and insufficient optimization performance. To avoid these problems, this research proposes an improved crown porcupine optimizer, CMTCPO. The specific improvement process is as follows.
- 1.
Use chaos mapping to initialize populations:
In many meta-heuristic algorithms, the initialization of the population often depends on the probability distribution to generate randomly. Although this method is simple, it may lead to insufficient population diversity, affecting the algorithm’s search efficiency and global optimization ability. In order to solve this problem, some scholars began to explore the introduction of chaotic mapping to improve the meta-heuristic algorithm. Chaos refers to the extremely complex and unpredictable dynamic behavior in deterministic dynamic systems due to their sensitivity to initial conditions. It has the characteristics of unpredictability, aperiodicity, and ergodicity [
72]. In recent years, chaotic mapping has been widely recognized in optimization. It has been proven to maintain population diversity and improve an algorithm’s global search ability [
73].
Table 2 shows 10 common chaotic mapping functions:
are based on a polynomial function,
are based on a trigonometric function,
is based on a complementary function,
is a piecewise function, and
is composed of various chaotic mappings.
Figure 4 shows the distribution of points generated by different chaotic maps. Except for
, the generated points fall within
, which ensures diversity and comprehensiveness in the search process. The points generated by
are evenly distributed and have good mathematical characteristics, but the points generated by some chaotic maps are unevenly distributed, resulting in performance degradation. Therefore, selecting the appropriate chaotic map is very important in practical applications. This study selects the above 10 chaotic maps to improve the CPO, and the improved effect is comprehensively compared. The optimal chaotic map is the Cubic chaotic map. The specific process is discussed in the
Section 4.
The initialization of the CPO also depends on random generation. Therefore, this study introduces chaotic mapping to improve the CPO, increase the diversity of the population, and improve the search ability of the algorithm. The position
of each CP individual can be initialized using Equation (27):
where
and
are the upper and lower bounds of the search space of the solution, respectively, and
is the chaotic map in
Table 2.
- 2.
Delete cycle population reduction technique:
The CPO’s cycle population reduction technique leads to a smaller search space in a certain period. In this study, the population cycle reduction technique is deleted, and the population size is kept unchanged to prevent the loss of the potential optimal solution.
- 3.
Hybrid mutation strategy:
With the increase in the number of iterations, the traditional CPO easily falls into a local optimum in the later stage, and convergence is slow. This study proposes a hybrid mutation strategy to improve this problem.
Convergence factor
is defined to judge whether the population is in the early or late stages of iteration. Let
be any positive integers and
be the maximum number of iterations; then,
changes according to Equation (28):
Figure 5 shows the changes of
under different parameter conditions. When
,
has very good properties: it decreases slowly. It has a large value in the first 500 generations and decreases rapidly to a very small value in the latter 500 generations. At this time, if
, it indicates that the algorithm is in the late iteration; otherwise, it is in the early iteration. Therefore, this study’s parameter for the convergence factor is
.
Step 1: differential evolution strategy:
The basic process of differential evolution is similar to that of the Genetic Algorithm (GA), including mutation, crossover, and selection. This strategy effectively solves the problem of insufficient vitality in the later stage of meta-heuristic algorithm iteration. Our study introduces this strategy into the first and third defense stages to improve the CPO’s convergence speed.
When , let be the original population, be the mutated population, and be the population after the crossover operation; then, the following process occurs:
Let
and
denote the scaling factor.
is an arbitrary random number between
, and
be a random number between
, which obeys the standard normal distribution
. In the
t generation, if a CP individual’s position is
, four individuals,
, and
, are randomly selected from the original population. The individual
can be represented by Equation (29) [
74]:
The crossover coefficient CR determines whether the individuals in the population perform crossover operation, which can be defined as:
Let
be an arbitrary random number between
. When
is less than or equal to the crossover coefficient
, the population performs a crossover operation. The individual
in the population
can be expressed as [
74]:
Individual
in the original population and individual
in the crossover population are selected according to the greedy algorithm, and the individual with better fitness is chosen as the next-generation individual [
74]:
Step 2: Cauchy mutation strategy:
Cauchy mutation is applied to improve the meta-heuristic algorithm. The basic idea is to use the Cauchy distribution to generate disturbance based on the current optimal solution to explore new positions in the search space. The peak value of the Cauchy distribution at the origin is smaller. Still, it is heavier at the tails at both ends, which makes the Cauchy mutation able to explore new solutions in a more extensive range, which helps the algorithm to escape from the local optimum and enhance the global search ability. The probability density function is as follows [
54].
where
is the position parameter, and
is the scale parameter. When
, the distribution represented by Equation (33) is called the standard Cauchy distribution.
Therefore, this study introduces the Cauchy mutation strategy as the fourth defense mechanism to increase the CPO’s global search ability in the later stage of iteration, help the algorithm move beyond the local optimal solution, and improve the diversity of solutions.
If
, in the
generation, the position
of the CP individual can be updated using Equation (34):
where
represents a random number that obeys the standard Cauchy distribution.
- 4.
Tangent flight strategy:
Some studies have introduced Lévy flight operators to improve the performance of swarm intelligence algorithms [
75,
76,
77]. Lévy flight is a random walk process that alternates between high-frequency, short-distance exploration and low-frequency, long-distance exploration, which can simultaneously avoid the local optimum and increase population diversity. Some scholars have recently proposed new algorithms and improvement strategies by imitating the Lévy flight process. For example, Layeb [
78] proposed the tangent search algorithm. Wang et al. [
54] noted that the tangent function tends to infinity at some points and is periodic and introduced the tangent flight operator to improve the HHO. Therefore, this study introduces a tangent flight operator when improving the CPO. The step length of the tangent flight can be controlled using Equation (35) [
54]:
where
is a vector composed of d-dimensional random numbers.
The direction is expressed using Equation (36) [
54]:
where
are random numbers obeying uniform distribution between [0, 1], and the value of
can only be
.
After performing all defense strategies in each round, we introduce tangent flight to update again:
where
represents the corresponding element multiplication.
Based on the above strategy, this study improved the CPO and named it the CMTCPO. The pseudo-code of the CMTCPO algorithm is shown in
Figure S1.
2.4. The Proposed Prediction Model
The prediction model for the sales cycle of the second-hand housing proposed in this paper can be realized according to the following process:
Step 1: data crawling and preprocessing:
Based on the Scrapy framework, a Python 3.12.7 program crawls the latest second-hand housing transaction information of Shanghai, Shenzhen, Guangzhou, Chengdu, Qingdao, Nanchang, Luoyang, and Guilin in 2024 on Lianjia.com. The information related to their geographical location is obtained from the Gaode map.
The missing values in the crawled data were addressed. To ensure the authenticity of the data, all cases with missing values were removed. Additionally, since some indicators are qualitative, they need to be converted into numerical values to be incorporated into the model.
Machine learning models require high-quality data. Different indicators have varying dimensions and magnitudes, which can lead to decreased prediction accuracy and slower model convergence. To enhance data analysis, normalization is necessary. We will utilize Min–Max Scaling as the normalization method.
Let
denote the normalized value of the data
,
denote the maximum value of the
index, and
denote the minimum value of the
jth index. Then, the normalization method of the larger and better index is:
Similarly, the normalization method of the smaller and better type of index is:
Step 2: correlation analysis:
Before selecting the model quantization period, it is essential to determine whether the model should be linear or nonlinear. The Pearson correlation coefficient is a useful measure for assessing the size and direction of the linear relationship between two sets of variables. Its formula is as follows [
79]:
where
are two sets of random variables, and
are the sample mean values.
In general, when , the correlation between variables is strong; when , it is general; and when , there is no correlation between the variables.
In our study, we will calculate the correlation coefficient between each index and the sales cycle to evaluate the strength of the correlation. We will then decide whether to use a linear or nonlinear model based on these results. If more than 70% of the indicators demonstrate significant correlation, we will select the linear model for analyzing the sales cycle; otherwise, we will opt for the nonlinear model.
Step 3: finding the optimal parameters of the HKELM based on the CMTCPO:
The partition ratio of the data set directly affects the training of the model. The common partition ratios include 8:2, 7:3, and 6:4. Each ratio has its own applicable scenarios. As the amount of data in the test set gradually increases, it helps to evaluate the model’s ability to see data comprehensively.
- 2.
Set the model parameters:
The performance of the algorithm needs to select basic parameters such as the appropriate population size
, the number of iterations
, the chaotic control parameter
, the convergence factor
, and the constant
. For general optimization problems, the population size can be selected from
, and the number of iterations is set to 100 to 200 times the problem dimension to ensure the coverage of the search space and to reduce the waste of computing resources. The convergence factor
and the constant
are determined according to Reference [
46], and
is 2.595.
The parameters of the HKELM model include the regularization coefficient , the weight , the bandwidth of the RBF kernel, and the parameters of the Poly kernel. These parameters are initialized according to Equation (12).
- 3.
Calculate the fitness:
After inputting the model parameters and the divided training set, the CMTCPO searches the multidimensional solution space and gradually approaches the global optimal solution through multiple iterations. The root mean square error (RMSE) can quantify the difference between the predicted value and the actual value of the model. The smaller the value is, the higher the prediction accuracy of the model is [
80]. Therefore, we choose the RMSE as the fitness function to evaluate the fitness of each generation, aiming for improved prediction accuracy. The calculation method is outlined in Equation (41) [
80]:
where
is the number of samples,
is the true value of the sample, and
is the predicted value of the sample.
- 4.
Update the optimal solution and value according to the fitness of each generation.
- 5.
Determine whether the current time satisfies the termination condition and output the optimal parameter; otherwise, repeat Step 3.
Step 4: establishing the HKELM model and predicting the sales cycle:
The optimal parameters found using the CMTCPO in Step 3 are substituted into the reconstructed HKELM model to predict the cycle of second-hand house removal.
Step 5: analysis of the accuracy and reliability of the predicted results:
To prevent over-fitting of the model, we performed a ten-fold cross-validation. This involved randomly dividing the data into 10 groups, using 9 of them for training and the remaining group for testing. If the RMSE, MAPE, and maximum relative error of the prediction results are very similar across the 10 iterations, it indicates that the model’s predictions are accurate and that over-fitting is not present.
To assess the reliability of the predicted sales cycle, this study employs the Bland–Altman analysis method. If the difference between the predicted results and the actual values falls within the 95% confidence interval, the predictions are deemed more reliable.
If the model meets the accuracy and reliability requirements simultaneously, the prediction process is completed. Otherwise, Step 3 and Step 4 will be repeated, and the prediction results will be analyzed again until the conditions are met.
The flow chart of this study is shown in
Figure 6:
In summary, our study used the SOR model to identify 33 influencing factors, and the selected influencing factors were more comprehensive than previous studies. At the same time, the original crown porcupine optimizer is improved and used to optimize the hybrid kernel extreme learning machine. Based on this, a novel second-hand house removal cycle prediction model is proposed.
5. Conclusions
In this paper, a novel prediction model of the second-hand housing sales cycle is proposed based on the improved CPO-HKELM method. Through in-depth analysis and data mining of the second-hand housing market, this model effectively improves the accuracy and reliability of cycle prediction. The experimental results using 400 groups of data from eight cities in China show that the maximum relative error of the improved CPO-HKELM model is 0.0001784, the MAPE is 0.00001235%, and the RMSE is 0.0002050. Three strategies of chaotic mapping, hybrid mutation, and tangent flight are used to improve the CPO. Specifically, chaotic mapping is added to the population initialization stage, which increases the population diversity. The hybrid mutation is added to the first, third, and fourth stages of the original CPO to avoid the algorithm falling into a local optimum, and a tangent flight strategy is added at the end to further broaden the search space of the algorithm. Compared with the classical CPO, GA, PSO, BO, SSA, WOA, and GWO, the improved CPO has the smallest calculation error and the fastest convergence speed. Compared with the BPNN, LSSVM, RF, XGBoost, and LightGBM, the HKELM has the lowest RMSE and the shortest computing time, handling high-dimensional complex data sets more effectively and significantly reducing the consumption of computing resources. In addition, with the help of the SOR model, we discuss many important factors that affect the second-hand housing sales cycle, including market supply and demand, economic indicators, policy changes, and so on. The above results show that the prediction model based on the improved CPO-HKELM can provide reliable theoretical support and a data basis for the research and practice of the second-hand housing market.
The future research direction can further explore the influence of multivariate variables on the model performance in order to further improve the forecasting ability and promote the in-depth development of real estate market analysis.