In this paper, a smart join method based on an optimization process is proposed. The aim of this optimization problem is to select the method that minimizes the errors of the resampling process for each feature.
3.1. Description of The Methodology
The general concept of the methodology of the smart join is explained next:
First, the joining model is fitted using training data; in other words, the optimal joining solution of the process is obtained. This needs to be done for each feature separately.
Then, resampled data is predicted by applying the selected join method to the test data.
Finally, the model is validated using resampling error.
Suppose we have a time series slice
y of the selected feature that needs to be resampled to be joined with a desired time index. First of all, the fit method is used in order to obtain the “optimal” join method. The inputs needed for the join are the original time series slice (
y with the original time index) and the desired time index. Other optional parameter can be a fill NA function as it can affect selecting the “optimal” method. Then, another slice of the same feature (
z) is used for the testing by the use of the method score. Finally, the optimal joined method is used for resampling other time slices of the features with the predict method. The structure of the different methods can be depicted as in
Figure 1.
The fitting process to find an optimal joining model could be mathematically represented as follows:
Suppose we have the time series
where
is the initial temporal reference system. Let
j be a join method from the available methods set
J (
). We need to obtain a new time series
with the desired temporal reference system
. The smart join algorithm aims to find the optimal join method
that minimizes an error function
. The parameters for applying the smart join method are the function meant to fill unavailable measurement values
. In case of not being specified, default values will be used (in which case
). The possible values of the imputation function
f are
(not filling NA values),
(using subsequent value that is nearest) and
(using prior value that is nearest). The optimization problem is defined as:
With respect to the second contribution put forward by the present paper, the error function
proposed is defined by Equation (
2).
where
with
and
are the weights for the total error calculation and, in case of not being specified, their default value is
.
In the following paragraphs, each function that takes part in the error is presented. Suppose and indicate the index of elements in and y respectively.
represents the percentage of NA elements of
after the application of
f. NA values can be problematic in machine learning applications implying for example the need to remove data points with NA value on the model training process or the impossibility to predict an output value using the trained model.
is the percentage of elements from
y that are not used in
. This value is related to the lost of information from the original time series due to the resampling needed.
indicates the percentage of delayed elements. If most of the data points from
y are delayed, the reality for the machine learning model is displaced. Depending on the application environment, taking decisions supported by the machine learning system that could not adequately represent the current situation can be problematic.
is the maximum difference in time between a delayed element used in
and its original time position normalized by the time window of
y. Whereas the previous case considers the frequency of delayed elements,
takes into account the magnitude of the displacement.
and
are equivalent functions but in this case for anticipated elements.
and
On the one side, the use of anticipated data is equivalent to the use of future information for prediction and results can be misleading and the used approximation should be sound enough to deal with value forecasting. On the other side, using future data could imply a need to wait for the arrival of a new observation to be able to make a prediction, or a correction would be needed once the predicted value and the real one are compared.
Finally
calculates the difference between the two time series (original and resampled). This value could represent the magnitude of the distortion committed due to the need of a joined data with synchronized temporal reference system.
where
and
are obtained by means of linear interpolation of time series
y and
respectively for time values in
.
Each part of the sum of the error calculation Equations (
3)–(
9) is normalized to guarantee that the result is in range
so different errors are comparable between them.
The fitting method can be seen graphically in
Figure 2.
Validating the joined method in different time slices of the time series is crucial. If the slice of data used to train the joining model is adequately selected, the errors should be similar in different time windows. Depending on the stability of the feature, retraining may be required as the optimal join method could not be the most adequate during all time period. Furthermore, selecting the desired temporal reference system () has equal importance as it should be the same for all the features, in order to be able to construct a database with all the features used by the model. Although the error calculation and the optimal joining methodology is chosen separately per feature, the desired temporal reference system is a common input of all the optimization problems and its selection affects to all the features.
3.2. Application Example
The current subsection introduces an illustrative example of the application of the proposed method to a dataset from a simple piecewise function. Suppose that the piecewise function is sampled irregularly in order to save memory applying two criteria:
The system checks every minute if the value of the data point has changed enough according to a pre-established criterion (in this particular case, a difference with the prior data point higher than 0.5) to save that data point.
Every minute the system also checks the difference in time with the last saved data point and if this difference is greater than or equal to four minutes it saves the last available data point.
The original piecewise function and the saved data using these criteria are shown in
Figure 3.
Suppose that the desired time reference system corresponds to
. Results after the application of different joining methods are shown in
Figure 4. Error values used in the optimization of the Smart Join methodology are shown in
Table 6.
Because the input for the algorithm is the received data, when default weights in the error function are used (
), the minimal error is obtained by the nearest join (see
Figure 4b). However, if knowledge about the irregular sampling approach used by the system is introduced by penalizing the anticipation of data points (for example with
), the optimal join method is backward join.
Figure 4d shows that the data points obtained by the backward join as a result of taking into account this extended description of the data sampling mechanism are the ones that are the closest to the real piecewise function.
Having established the significance of the measure of quality of a joining method, in the remainder of this contribution we leverage mathematical optimization techniques on training data to automatically determine which of the joining methods is most adequate for a given time series.