2.1. Chi-Square Independence Test
The chi-square (CHI2) test examines whether there is a relationship between two categorical variables in a single sample. This test assesses the independence or association between two categorical variables [
22]. The null hypothesis (
) suggests that there is no association between the two variables, implying they are independent. Conversely, the alternate hypothesis (
) posits that there is a significant association between them, indicating they are not independent. From a contingency table (
Table 1) that displays the frequency counts of the joint occurrences
of the two categorical variables, let
X represent one variable with categories
and
Y represent the other variable with categories
.
The values
and
are the marginal totals of the rows and columns, respectively, and are calculated as
and
. Under the null hypothesis of independence, the expected frequency (
) for each cell in the contingency table is
where
.
The chi-square statistic (
) measures the discrepancy between the observed frequencies (
) and the expected frequencies (
) given by (
1), and is defined as
Under certain conditions, such as when the sample size is sufficiently large and the expected frequencies are not too small, the statistic follows a chi-square distribution with degrees of freedom. Typically, each expected frequency should be at least 5 to ensure the validity of the chi-square approximation.
To determine whether to reject the null hypothesis, we use the statistic. The null hypothesis is rejected if the statistic exceeds the critical value from the chi-square distribution at a chosen significance level (such as ). Alternatively, the null hypothesis is rejected if the p-value is less than the significance level.
It is important to note that the chi-square test is an omnibus test. Therefore, if the test indicates a significant association, post hoc procedures need to be conducted to compare individual conditions.
2.2. Recursive Feature Elimination with Multinomial Logistic Regression
The RFE process aims to select the most important features by iteratively removing the least significant ones, thereby improving the efficiency of the feature-selection process. As explained in Algorithm 1, the process begins with constructing a model using all available features. Subsequently, each feature is assigned a weight based on its relevance in classifying the target variable. The feature with the lowest weight is then eliminated. The model is subsequently reconstructed, and the importance of each remaining feature is recalculated [
23]. This iterative process continues until the remaining set of features meets a predefined performance threshold.
Algorithm 1 Recursive Feature Elimination (RFE) |
Input: Dataset X with feature dimension k Output: ← Set of features providing the highest performance metric
- 1:
Initialize: - 2:
Train a model on X using features in S - 3:
min_score performance metric - 4:
for each feature f in S do - 5:
- 6:
Train a model on X using features in - 7:
if performance metric < min_score then - 8:
- 9:
min_score performance metric - 10:
end if - 11:
end for - 12:
Remove from S
|
Given that the response variable in this study, household energy usage, is categorized into three distinct classes (low-, medium- and high-load-consumption profiles), we employed a multinomial logistic regression (MLR) algorithm. This generalized linear model is particularly suitable for situations where the response variable encompasses more than two categories. The MLR algorithm utilizes a non-linear log transformation, which facilitates the calculation of the probability of occurrence for each class of the dependent variable [
24].
Several assumptions were verified to ensure the applicability of the multinomial logistic regression:
Independence of irrelevant alternatives: The assumption that the odds of preferring one class over another are independent of the presence or absence of other alternatives. This was tested using the Hausman-McFadden test [
25].
No multicollinearity: Multicollinearity among predictors was checked using variance inflation factor (VIF) values, ensuring all were below the threshold of 5 [
26].
Linearity of logits: The relationship between continuous predictors and the logit of the outcome was confirmed to be linear [
27].
Large sample size: The sample size was sufficiently large to provide reliable estimates of the model parameters [
28].
The probability that an observation
belongs to a particular class
, given the predictor variables
, is denoted as
and is given by
where
denotes the intercept for class
c, and
represents the vector of regression coefficients for class
c.
To assess the performance of the classifier, a confusion matrix is employed to compare the predicted classes to the actual classes from the ground truth data. The accuracy of the model is then calculated by dividing the number of correct predictions by the total number of predictions, thus providing a metric to evaluate the model’s performance.
2.4. Fuzzy Rough Feature-Selection Method
Zadeh’s fuzzy set theory [
32] extends classical set theory to handle uncertainty and vagueness. In classical set theory, an element either belongs to a set or does not. However, fuzzy set theory allows for partial membership, where elements can belong to a set with varying degrees between 0 and 1. This approach accommodates uncertainty about the boundaries of sets.
Let V be the universe of discourse, which is the set of all possible elements under consideration in a given context. A fuzzy set A is defined as a set of ordered pairs , where the membership function represents the degree of membership of element v in the fuzzy set A.
Rough set theory aims to handle incomplete or imprecise information by distinguishing between certain and uncertain knowledge [
33]. Let
U be the universe of objects, which includes all specific objects under analysis in the given context. While
V represents all possible elements considered,
U focuses on the particular set of objects being studied. Let
be a relation representing the lack of knowledge about elements of
U. For a set of objects
,
denotes the equivalence class of
R determined by element
x. The rough set
X is composed of the tuple
, where
is the lower approximation and
is the upper approximation, defined as
In the rough set, represents the certain elements of X and includes both certain and uncertain elements. The boundary region delineates regions of complete information (positive region) and regions of uncertainty (boundary region): .
The fuzzy rough sets [
34] combine fuzzy set theory and rough set theory to handle uncertainty and imprecision in data. Let
V be the universe of discourse, and
be a fuzzy set in
V. Let
U be a non-empty universe, and
R be a similarity relation on
U. For
, the fuzzy rough set is the pair
of fuzzy sets on
U, such that for every
The fuzzy rough sets feature-selection (FRFS) algorithm identifies relevant features that distinguish between different classes or concepts in a dataset. It uses rough set approximations to evaluate feature contributions and iteratively refines feature selection to balance relevance and redundancy in the selected feature subset. The algorithm considers the dataset’s inherent properties to ensure optimal feature selection.
The algorithm selects features that increase the positive region size until it matches the size of the positive region with all features or the required number of features. The concept of indiscernibility is central to locating a positive region. Suppose we have a non-empty set of objects
U and a non-empty set of attributes
A. The indiscernibility of two objects
and
based on the sets of attributes in
F, where
, is given by
Using indiscernibility, we can identify the partition of
U generated by IND(F). This partition is defined as
where ⊗ for two sets
A and
B is represented by
Let
Q be an equivalence relation over
U. The positive region
can be found using
where
are the rough sets of lower approximations, defined as
If the goal is to identify the largest positive region for a specific number of features, denoted as k, the FRFS algorithm calculates the for each set of k features and selects the set that maximizes it. This means it chooses the set with the highest number of . Alternatively, if we want to consider all possible combinations of features, from to , FRFS will give us the smallest number of features that maximize .
The fuzzy rough nearest neighbor (FRNN) classification method can handle fuzzy and uncertain data. It extends the nearest neighbor classification concept by classifying test objects based on their similarity to a specified number of neighbors, denoted by
K. The method considers the membership degrees of these neighbors to the class labels when assigning a class label to the test object [
35].
The FRNN algorithm computes the distances
between an unclassified object
y from the testing data and each object
from the training data. After calculating these distances, FRNN considers the fuzzy memberships of the
k nearest neighbors to determine the fuzzy membership of
y to each class
c. The aggregation process combines the fuzzy memberships of the neighbors to determine the fuzzy membership of the test object. The membership degree of a given new sample
y in a class
c, represented by the
k nearest neighbors, is measured as
where
is the fuzzy strength parameter and
is the membership of the sample
from the training data to the class
c among the
k nearest neighbors.
To classify the test data point y, we select the class c with the highest fuzzy membership value . Subsequently, we compare the predicted class labels (associated with fuzzy memberships) to the true class labels (i.e., the original class of the test data point). The accuracy is calculated by determining the correctness of predictions for each data point y.
The algorithm for the FRFS is presented in Algorithm 3.
Algorithm 3 Fuzzy Rough Feature Selection (FRFS) |
Input: Dataset D with features , target variable Y Output: Selected feature subset
- 1:
Initialize: - 2:
Calculate initial positive region for all features F - 3:
while
do - 4:
for each feature do - 5:
Compute the positive region for feature - 6:
end for - 7:
Find the feature that maximizes the positive region - 8:
Add to - 9:
Update the positive region - 10:
if equals then - 11:
break - 12:
end if - 13:
end while - 14:
return
|