Next Article in Journal
Diffusion Model for Camouflaged Object Segmentation with Frequency Domain
Previous Article in Journal
CECL-Net: Contrastive Learning and Edge-Reconstruction-Driven Complementary Learning Network for Image Forgery Localization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery

College of Engineering Training Centre, China Jiliang University, Hangzhou 310018, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(19), 3920; https://doi.org/10.3390/electronics13193920
Submission received: 5 September 2024 / Revised: 26 September 2024 / Accepted: 2 October 2024 / Published: 3 October 2024

Abstract

:
In legacy industrial systems, discovering joinable information between database tables is important for applications such as data integration and data analysis. Locality-Sensitive Hashing-based methods have been proven to be capable of handling chaotic and diverse table relationships, but these methods often rely on an incorrect assumption—that the similarity of table columns in the database directly reflects their joinability, causing problems related to the accuracy of their results. To solve this problem, this study proposes a dynamic data-driven time-slicing Locality-Sensitive Hashing method for joinable table discovery. This method introduces database log information and within different time slices, uses the co-occurrence matrix of data tables to determine their joinability. Specifically, it first performs a MinHash dimensionality reduction on database columns and then uses Locality-Sensitive Hashing to calculate the static similarity. Next, it identifies business modular time slices through database logs, calculates the dynamic similarity of the slice time data, and builds a co-occurrence matrix between tables. Finally, the joinability between data tables is calculated using the static similarity, dynamic similarity, and co-occurrence matrix. The experimental results demonstrate that this method effectively excludes tables that only have similarity but no business relationship for data cleaning, and its accuracy exceeds that of methods that only depend on similarity.

1. Introduction

The construction of factories has become the main driving force triggering growth in the digital economy, opening a new era of data-driven production and enterprise transformation. As a key element of production, the value of data is becoming increasingly apparent in this era. To fully utilize the core role of data, we need to organize and clean the information contained in the databases of various industrial systems, especially when considering the manufacturing sector, due to the complexity of manufacturing processes and technology. The relationships between data tables are very complex, with the number of tables far exceeding the number of ordinary systems, and it is necessary to join dozens of table fields to query and sort the required results. In the field of data element analysis, joining two or more database tables is a fundamental and necessary task. Database table joining is not only widely used in the fields of database management and business intelligence but is also commonly applied in the field of machine learning [1,2]. Finding tables that can be joined is essentially an inverse analysis process of the current database design; that is, a type of data schema analysis. At present, data schema analysis approaches can be divided into two types: schema-level and instance-level methods. Schema-level methods focus on the structure and design of the database; however, for small- and medium-sized manufacturing enterprises, information about the design of many legacy system databases is often difficult to obtain. Meanwhile, instance-level analysis methods focus on the actual data in databases. Existing instance-level association rule mining algorithms include Apriori [3] and FP-Growth [4], among others. Most research has focused on mining association rules; however, in the cleaning of data elements, more attention is paid to whether database columns can be subjected to joinable query operations to obtain detailed information about a particular business.
Existing column relationship schema-matching techniques mainly use simple statistics- and string-based comparisons, such as prefix/suffix detection, edit distance, value range, minimum/maximum similarity, and mutual information measurements based on the q-gram distribution [5,6,7,8]. Set similarity search is another version of this problem [9], which has been widely used in various applications such as document searches, keyword querying, entity recognition, and data cleaning [10]. As examples of relevant research, Zhu et al. [11] proposed the Intersection Estimation algorithm to find joinable tables; Zhang et al. [12] clustered data columns to discover joinable tables; Esmailoghli et al. [13] aggregated row values into hash values to construct a super key, facilitating the search for joinable tables across multiple columns; Dong et al. [14] used a pre-trained model to convert text attributes into high-dimensional vectors, thereby semantically connecting records through similarity predicates to obtain more accurate results; Grace et al. [15] proposed the Starmie framework, which uses a pre-training strategy to capture contextual semantic information in a table and uses the cosine similarity between column vectors as the join degree score; Santos [16] added binary relations to tables to improve the accuracy of joinable table retrieval using a knowledge base approach; and Ibraheem [17] outlined various popular methods for finding joinable tables in data lakes and their associated performance.
Regrettably, in the existing literature, similarity and joinability are often regarded as the same concept. Similar columns can be joined, and joined columns carry a certain degree of similarity, but similarity does not necessarily denote business relevance; this is because similarity tends to be analyzed from a data perspective, whereas joinability tends to be analyzed from a business perspective. Therefore, joinability can sometimes be meaningless. For instance, while the key attributes of a sales order table and an attendance information table may meet certain similarity requirements, they will not be joined during data cleaning. In this study, we found that the tables representing a user’s business activities are modular and correlated in a given time period. Therefore, we propose a dynamic data-driven time-slicing Locality-Sensitive Hashing (LSH) method (DTL) for joinable table discovery. This method considers both static and dynamic similarity, combined with a co-occurrence matrix (i.e., the number of times that the tables co-appear in the time segment) in order to determine joinability. When calculating the static similarity, all the data in the database are used, MinHash [18] is applied to reduce the dimensionality, and LSH [19] is used for calculation. Dynamic similarity is achieved through analyzing database logs and constructing time slices with business modularity; then, in these modularized data, the similarity between tables is calculated, allowing for a co-occurrence matrix to be constructed on this basis. This method can exclude tables that have high similarity but are irrelevant to the business. The proposed method was tested on a real enterprise data set, and the experimental results demonstrate that its accuracy in obtaining joined tables surpasses that of methods that rely solely on similarity.
The main contributions of this study include the following:
(1)
It investigates and defines a method for representing joinability through combining static and dynamic similarity.
(2)
It innovatively uses changes in database logs to study the joinability information between tables and reveals how to use database operation logs for dynamic similarity analysis.
(3)
It proposes a data-partitioning method based on time slices. By stacking different time slices, we can obtain the co-occurrence matrix of each table regarding specific operations. Using this co-occurrence matrix as a parameter in combination with the static and dynamic similarity, we can extract collections of joined tables.

2. Methodology

2.1. Preliminaries

In this subsection, the basic methods adopted in this study are introduced, including MinHash, LSH, and the use of database logs. Table 1 summarizes the notation frequently used in this study.
In this study, attributes and columns are considered equivalent, representing the collection of data in a table. Suppose we have a given table A . The ith column of A is labeled A i , and the value in the ith row and jth column of A is denoted by a i j . Then, table A can be denoted as
A = [ A 1 , A 2 A n ] = a 11 a 12 a 1 n a 21 a 22 a 2 n a m 1 a m 2 a m n
The similarity between the attribute domains A i and B j is denoted as ρ , which is calculated as follows:
ρ = d ( A i , B j ) = A i B j A i B j
Given a similarity calculation method and a threshold τ , the attributes A i and B j are considered a match if (and only if) ρ τ ; otherwise, they are considered a mismatch. The matching degree ρ measures the similarity between the two columns of data.
Similarity is generally considered as the basis of joinability. In a simple method for calculating column similarity, for the data in attribute Q , the similarity distance with respect to all elements in R is calculated in turn. When the database contains n columns of attributes, the time complexity of this method is O ( n 2 ) . However, when the number of attributes reaches a certain order of magnitude (e.g., over 10,000), and the number of rows in each column set exceeds 1 million, the time consumption of this brute force calculation method will be very large, making it unfeasible.
In the literature [18], a method for calculating the Jaccard similarity between sets of any size was proposed. In this technique, a set of attributes is converted into a MinHash signature using a set of Minwise hash functions. Suppose h m i n is one such hash function; then, the minimum hash values for domains A i and B j are h m i n ( A i ) and h m i n ( B j ) . The probability ρ of two minimum hash values being equal is considered as the similarity between A i and B j [18]:
ρ = p ( h min ( A i ) = h min ( B j ) ) = d ( A i , B j )
LSH is a solution for the nearest neighbor search problem in high-dimensional space [19], which can also be applied to approximate R-nearest neighbor searches [20]. In LSH, if two objects are similar, their distances after projection using the distance-sensitive hash function will be very close; meanwhile, if they are dissimilar, their distances will be relatively far. William et al. [21] proposed the Multi-Probe LSH method; its core concept is to check all hash buckets that may contain neighboring points of the query point according to a specific probe sequence; that is, if an object is close to the query object but has not been projected into the same hash bucket, it might be projected into a bucket nearby. This method can effectively reduce memory and time overheads by reducing the use of hash tables and expanding the query range, and it can also accurately obtain the candidate set.
When a program runs normally, the database continuously generates logs, recording all operations in detail. Prior studies [22,23,24] have outlined the development trends and attention in this field of log analysis. For example, system administrators can obtain a better understanding of the health and operating status of the program [25] by regularly checking the log data. When a user performs a business operation on the system page, it may trigger changes in the data of multiple tables. For example, adding a new sales order involves inter-related tables such as the order information table, the detail table, the inventory table, and the shipping table. If the operation is successful, the data changes in these tables will be sequentially recorded in the database log; therefore, once a table changes, joinable tables are likely to appear in its context.
As the log record data are closely related to the business information, combining this information with respect to the similarity of existing data allows us to more effectively determine the joinability of tables.

2.2. Dynamic Data-Driven Time-Slicing LSH Method

This subsection describes a novel method for calculating the joinability of tables based on dynamic data-driven time-slicing LSH (denoted as DTL). This method processes the table data in a database using MinHash and LSH, obtaining the static similarity; then, it analyzes the database logs, constructs a time slice with modularity (with respect to the business information) to calculate the dynamic similarity, and forms a co-occurrence matrix. The co-occurrence matrix is then combined with the static and dynamic similarity data to obtain effective joinability information. The framework of the proposed method is depicted in Figure 1.
Figure 1 indicates the relevant modules of this study. Initially, LSH hash tables are generated by traversing the column data to establish static similarity. Next, the database logs are analyzed to extract time slices, and the dynamic similarity information between fields and the co-occurrence matrix between tables is constructed. Finally, using the LSH hash table and the dynamic similarity information, attribute elements that satisfy the similarity conditions are extracted and added to the candidate set, and the joinable table of the candidate set with the element to be queried is computed.

2.2.1. Constructing the Static Similarity

  • Defining the Minwise Function
When defining several hash functions h i , these functions can randomly change the order of the elements. A hash function can be defined as
h i ( x ) = ( a i · x + b i ) mod p
where a i and b i are randomly chosen constants, p is a large prime number, and x is a data set of attributes. For each column Q in the database, the MinHash signature is the minimum value calculated by multiple hash functions. Using k hash functions, the signature can be expressed as
M i n H a s h Q = [ min ( h 1 ( Q ) ) , min ( h 2 ( Q ) ) , , min ( h k ( Q ) ) ]
where m i n ( h i Q ) represents the minimum value calculated by the column through the hash function h i ; thus, the high-dimensional attribute set can be reduced to k dimensions.
2.
Defining the LSH Function
In this method, L hash functions h 1 , h 2 , , h L are chosen to form a hash function group G x = ( h 1 x , h 2 x , , h L x ) . For each hash function group G i ( x ) , a hash table is constructed; this process is then repeated S times to obtain S hash tables.
3.
Calculating LSH Tables
Due to the large number of tables, fields, and data in the database, loading them all into memory will put pressure on the server and the database. To solve this problem, this method dynamically concatenates SQL to read column data based on the database configuration information (including table name, field name, field type, and field length), reducing the relative load on the server and database. After obtaining the necessary information, we can use the above function to calculate the MinHash signature and create an LSH hash table. The similarity set of the object being queried can be obtained through the LSH hash table, and the resulting similarity value is the static similarity.

2.2.2. Dynamic Data Acquisition

In small- and medium-sized manufacturing industries, database management systems such as MSSQL, Oracle, MySQL, PostgreSQL, and MongoDB are widely used, which all provide features for logging changes in data. To improve the efficiency of capturing changes in database logs, the Change Data Capture (CDC) strategy [26,27,28] can be employed. The core idea of CDC is to track and capture changes in the database, including insertions, updates, and deletions of data or tables, and then record these changes in the order they occurred in a complete manner. For subsequent analysis needs, the captured log data are processed into a specific format and stored in a high-throughput and durable Kafka [29] message queue, making it convenient for applications to consume and analyze data at different time intervals, as shown in Figure 2.
In Kafka, log information is unitized by row. Each row of the record corresponds to a data change in the table, and its contents include the change in table name, the field name, the field value, and the type of operation. It is read in time slices as the calculation unit, where a time slice is t seconds. If R t represents the log set within time t , then R t = { R t 1 , R t 2 , , R t m } , where R t i represents the ith log record in time slice t . Each R t i can be represented by four-tuple ( o t i , r t i , f t i , v t i ) , where o t i represents the operation type, including addition, modification, and deletion; r t i represents the table name; f t i is the field information in the table; and v t i is the list of field values. The latter is expressed as v = { o l d , n e w } ; for the addition and deletion operations, the value of o l d is empty; meanwhile, for modification operations, o l d is the value before the modification and n e w is the value after the update.
Let F i e l d = { ( r t i , f t i ) | ( o t i , r t i , f t i , v t i ) R t } denote the union of combinations of table names and field names within time slice t . D a t a ( F i e l d ) represents the data set of all fields, and D a t a ( F i e l d r f ) represents the data set of field f in table r . The initial condition is D a t a F i e l d = .
For the addition operation, the data in the corresponding fields are added to the data set. In the deletion operation, before they are deleted, these data are similar to some of the columns. Therefore, for the deleted data, we also perform the following addition operation:
D a t a ( F i e l d r f ) = D a t a ( F i e l d r f ) { v t i | ( o t i , r t i , f t i , v t i ) R t r = r t i f = f t i ) }
For the editing operation, we record the data before and after the edit. Regarding joinability, if the data were joinable before, then the data after the edit are also joinable. Therefore, we add both the data before and after the edit to the collection, as follows:
D a t a ( F i e l d r f ) = D a t a ( F i e l d r f ) { v t i ( o l d ) v t i ( n e w ) | ( o t i , r t i , f t i , v t i ) R t r = r t i f = f t i ) }
In this way, we can obtain all dynamic attribute data within time slice t .

2.2.3. Time Slice and Dynamic Similarity

Choosing the appropriate time slice is key to ensuring the accuracy of the obtained dynamic similarity value. First, the optimal splitting point must be found, which is the maximum value for the time interval between two log records in the Kafka message queue after the end of the last read. However, this interval needs to be within a certain pre-defined threshold range of the last read time. Therefore, the minimum time length I m i n and the maximum time length I m a x of the read slice are taken as constraints for timestamp filtering, in order to prevent the time slice from being too large or small. Let K be the timestamp of the last log read, t h r e s h o l d be the minimum value of the time interval, and T be the sequence of log timestamps; then, T can be expressed as T = [ t 1 , t 2 , , t n ] , where t i is the timestamp of the ith log. First, we need to filter out the data information, denoted as T * , with timestamps after K and in the range between I m i n and I m a x , as follows:
T = [ t i | t i > K t i K I min t i K I max ]
Then, calculate the time interval between adjacent records in the data set T * and obtain the maximum value:
Δ T = [ Δ t i = t i + 1 t i | t i , t i + 1 T ]
Δ t max = max ( Δ T )
Δ t = Δ t max , Δ t max t h r e s h o l d I max , Δ t max < t h r e s h o l d
To determine the optimal module, the value of the t h r e s h o l d parameter must be set, which has a significant impact on the correlation results. In extreme cases, if the t m a x interval is very small and the current user is constantly operating, the log time may not distinguish the business modules as they are operating, so I m a x will be directly used as the slice time. Conversely, if the t h r e s h o l d value is set too large, it may lead to slices that have a missed modularity, thus affecting the quality of time slicing and the accuracy of the results.
Through this method, we obtain the optimal timing point, time slice, and finally, the dynamic data information R Δ t = { R t 1 , R t 2 , , R t m } within this time slice. According to the above method, we can calculate the dynamic data set D a t a ( F i e l d ) . We then use the MinHash signature and build an LSH hash table, using the Multi-Probe LSH method for similarity retrieval. A four-tuple ( f i e l d i , f i e l d j , s ,   c o u n t ) for the similarity between two candidate sets is defined, where f i e l d i represents the column name, s represents the similarity value between columns i and j, and c o u n t represents the cumulative number of time slices—that is, the number of times the table where the ith column is located co-occurs, which is initially 0. Every time it occurs, an operation is performed to add 1 to c o u n t ; as such, for the Lth time slice, it can be expressed as follows: ( f i e l d i , f i e l d j , s L ,   1 ) . For the existing time slices, the similarities s between the same columns can be superimposed, as follows:
s = s · c o u n t + s L c o u n t + 1
count = c o u n t + 1
Using this method, after processing multiple time slices, we can obtain attribute matrix information that represents the dynamic similarity values.

2.2.4. Joinability Construction

When the data in a certain table change, joined tables often synchronize in the context of the log. This study uses a co-occurrence matrix to represent the number of times that tables appear together in a single time slice. If two tables frequently appear together in a time slice, their co-occurrence value and the corresponding probability of joinability are both larger. Assume that there are two data tables, A and B , which appear at the same time in time slice T i ; as the number of time slices increases, the number of times that they appear together can be expressed as
C A B ( T i ) = C A B ( T i 1 ) + 1
The value of co-occurrences C A B is initially 0 and will be increased by 1 when two tables appear at the same time in the same time slice. The larger C A B is, the more likely they are to belong to the same business module, and the stronger the join. If two columns can be joined with each other, they must have a certain similarity, so the attribute A i is queried through the LSH hash table and dynamic similarity information to obtain the candidate set that meets the similarity conditions; then, an element B j from it is selected, in turn, to obtain the static similarity between the attributes A i and B j for d ( A i , B j ) and dynamic similarity s ( A i , B j ) . In the data cleaning, the tables that can be joined have a business module, and there is similarity in the data. The joinable degree of attributes A i and B j can be expressed as:
l ( A i , B j ) = σ · d ( A i , B j ) + ( 1 σ ) · s ( A i , B j )
σ = i C A B i
where σ represents the probability of tables A and B appearing together in all time slices. As an example, when a user generates a sales order, it will not be related to the meeting information table; however, this does not rule out the similarity between the data in the order module and the data in the meeting record table. Because they do not appear together, at this time, σ = 0 , and the dynamic similarity s ( A i , B j ) is 0; thus, l ( A i , B j ) is also 0. Although they are similar, they are not related. Through this method, we can exclude tables with no business relationships. If tables A and B have business relevance, when the time slice reaches a certain number, σ 0 . At this time, s ( A i , B j ) tends to be similar to d ( A i , B j ) , and their similarity can represent their joinability. The joinability calculation process is shown in Figure 3.
The joinability between tables A and B is the maximum joinability value between the two table columns:
l ( A , B ) = max ( l ( A i , B j ) )
In this way, the elements in the candidate set are computed to obtain the set of correlatability tables.

3. Experiments

3.1. Data Sets

In this study, we used publicly available data sets that are representative of the similarity processing field. However, this research required complete data log information; thus, to more closely approach the actual situation, we chose to use the most commonly used, most complex enterprise resource planning (ERP) database in the telecom cable industry, which has been used in the production environment for many years. It contains 5574 tables and 95,370 columns, with a total size of about 3 GB; the column type details are shown in Figure 4.
Figure 4 shows that most fields use char, float, varchar, and datetime; during the cleaning process, the joinability between data tables is mainly determined according to primary keys or business codes. We directly filtered out non-joinability fields such as datetime, binary, and short char, resulting in 6081 fields that satisfied the joinability requirements. Consequently, 713 tables containing records remained, with an average of 68,000 items per table. Therefore, we only needed to calculate the joinability of the filtered table fields; the common types of joinable fields are shown in Table 2.
The experiment was carried out on a computer equipped with an AMD 7840HS CPU (8 cores) with 64 GB of DDR5 memory. The operating system for this machine is Windows 11 Professional Edition, and the method was implemented on Python 3.8, using libraries such as numpy, pandas, and datasketch for data processing and analysis.

3.2. Evaluation Metrics

This study focused on improving the accuracy of joined table discovery by eliminating tables that only show similarity but are irrelevant to the business during the data cleaning process; thus, precision and recall, which are widely used, were adopted as the measurement standards. The evaluation criteria in this study were determined during manual data cleaning. For a query column Q , T P denotes the set of tables with businesses joined with Q identified through tagging, and F N is the set of joinable tables calculated with the present method. Precision represents the proportion of correctly identified joinable tables out of all the predicted joinable tables, as follows:
p = T P F N TP F N
Recall represents the proportion of correctly identified joinable tables out of the total set of annotated joinable tables and is defined as follows:
r = T P F N T P

3.3. Performance Comparison

To highlight the effectiveness of the proposed approach, we chose to compare it with the previously proposed techniques LSH Ensemble [30], Aurum [31], Josie [11], and LSH. LSH Ensemble (L32, with the number of shards set to 32) uses a transformation that allows MinHash LSH to be employed with a set containment similarity; then, the algorithm reads and computes the exact set intersection size for candidates to obtain the ranked final results. Aurum utilizes schema- and instance-level information to identify different relationships between the attributes of a data lake. Though performing a two-step dissection and indexing the data, we can establish a graph structure for keyword searches and/or determination of joinability. Josie is a powerful database associative table discovery framework that uses an intersection-estimated connection search to evaluate the joinability of tables and return the top k joinable tables as results. The original algorithm only supports a top k search, so we modified the algorithm to set the intersection-estimated connection value as a condition for ending the search, and the intersection-estimated value is the same as τ . At the same time, in order to better illustrate the perspective of this article, LSH is also selected as a comparison method.
For the experiment, I m i n was set to 20 s, I m a x to 5 min, and t h r e s h o l d to 10 s. A total of 16 tables were selected, for which we obtained the average precision of the compared methods under different values of the similarity threshold τ , as shown in Figure 5.
Figure 5 shows that when applied to data obtained from small- and medium-sized manufacturing industries, DTL was superior to Aurum, L32, and Josie in terms of joinability accuracy, with an average advantage of about 7% compared to the Josie framework. The methods of L32, Aurum, and DTL are based on the LSH index type. When the LSH method is used alone, its accuracy is very low because it can only represent static similarity, indicating that column similarity cannot not intuitively reflect the joinability between tables. The Aurum method achieves a relatively higher result than LSH alone because the relationship between the extracted data sets is used as a knowledge graph and is combined with database schema information. Although L32 divides the data set into different sections based on the distribution of column values, it does not incorporate time and cannot capture the shared contextual information of all tables within a specific timestamp, so its effectiveness is limited. Although Josie, as a powerful framework for discovering related tables, is insensitive to data distribution, its effectiveness still has some gaps compared with DTL. This also proves the feasibility and advantages of using static and dynamic similarity in finding joinable tables in small- and medium-sized manufacturing enterprises; the use of log data to construct business modules to remove similar but non-business joinable table information is the reason why the DTL method works well in small- and medium-sized manufacturing industries. Under the same conditions, the recall rate is shown in Figure 6.
Figure 6 indicates that the recall rate of the DTL was somewhat lower than those of Aurum, L32, and Josie; this lower recall rate is because these comparison methods solely rely on the similarity analysis of data dimensions, resulting in them obtaining higher recall than our method. In particular, the LSH method has a recall rate of over 90% because it completely regards similarity as joinable and covers many tables that have no business relationship, so its precision is very low. Additionally, this study employs a method that combines dynamic and static similarity in a modular manner. As a result, tables that are seldom manipulated do not appear in the log, and when σ = 0 , their degree of joinability is perceived as zero, implying that there is no business joinability. This table information will then be filtered out.
This experiment showed that, although the recall of the proposed method was slightly lower than the values for the compared methods, as the table set F N covers truly business-related tables and filters out some irrelevant table sets, T P tends to be similar to F N ; therefore, the higher precision of the proposed DTL method was more meaningful.
The time complexity of the processing phase includes the creation of the hash table and the analysis of the logs, whereas the time complexity of the query phase remains the same as Multi-Probe LSH.

3.4. Investigation of Parameter Sensitivity

This subsection describes the effects of different parameter values on the experimental results.

3.4.1. Threshold Experiment

The t h r e s h o l d value directly affects the size of the time slice. If it is too large, a situation where Δ t m a x < t h r e s h o l d will occur, and the time slice will be adjusted to I m a x . Experiments were conducted with different t h r e s h o l d values under the same parameters, and the results are shown in Figure 7.
The best results were achieved when t h r e s h o l d was in the range of 3 to 10 s. Considering the complexity of page operations in the industrial sector, the impact of t h r e s h o l d is relatively limited when the value is small; users need to perform page switching or button-clicking operations when an operation is completed and then trigger another operation, which takes a certain amount of time to complete. However, when the time interval exceeded 15 s, the detection effect significantly decreased. This decrease is due to an inability to effectively divide the time slices, resulting in the filtering out of reasonable business modules, thus affecting the frequency of the co-occurrence matrix and leading to decreased accuracy.

3.4.2. I m a x Experiment

I m a x can directly affect the structure of the business modules. We used minutes as the unit of measurement and analyzed the impact of this parameter on the results, as shown in Figure 8.
Figure 8 indicates that when I m a x exceeded 1 min, the impact on the results was not significant. This is partly because, when acquiring time slices, there will be several slices that meet the required time within I m a x . Only a few slices are taken as I m a x , due to not finding an appropriate interval. When it was set to more than 10 min, a minor impact on the experimental results was observed, as all the data tables in the entire I m a x tend to be considered as one module. At this point, as long as the columns are similar, they will be judged as being joinable.

3.4.3. Time Slice Statistics Experiment

Next, the proportion of time slices of different lengths was calculated based on the obtained slice information, as shown in Figure 9.
The area marked by two green lines represents the time slices from 8 to 44 s, which occupy 82.7% of the total time slices. The most common slice time was concentrated around 22 s, representing the time it takes for a user to perform a certain business operation. For operations that are logically complex, involve multiple tables, or have a large amount of data, more time is needed for business logic processing and database operations, which can lead to an increase in the length of the time slices.

3.4.4. Time Slice Quantity Experiment

Based on the characteristics of user behaviors, 22 s was selected as the average length of a time slice. A statistical unit was set to contain 1000 time slices, and by analyzing the user’s usage, this statistical unit can generally cover the number of logs for about one day. Nearly 120 statistical units were used for the analysis, as shown in Figure 10.
Figure 10 shows that the joinability results significantly depended on the number of time slices. Specifically, when the number of time slices was less than 30 units, it was difficult to obtain evaluation results; however, the accuracy only significantly improved when the number of time slices reaches a certain level. When reaching 80 units, the influence of the number of time slices on the results tended to become stable, as this period basically covers the common operations of users.

4. Conclusions

This paper addresses the problem of discovering joinable tables in the context of data cleaning in the manufacturing and related industries. A time-slicing LSH (DTL) method based on database logs was proposed, designed to improve the accuracy of the discovery of joinable tables. The proposed method is particularly effective in small and medium-sized manufacturing enterprises, in which it accurately identifies joinable tables and reduces the time spent finding joinable tables, thereby providing a more efficient strategy for data cleaning.
The improvement in accuracy achieved using this method is primarily due to the integration of dynamic and static similarity measures. Traditional approaches typically rely solely on static similarity, which overlooks the temporal dynamics of data. In contrast, our method captures the business modularity between data tables by extracting effective time slices from database logs. This approach represents the relationships between tables using a co-occurrence matrix, while also calculating dynamic similarity within each slice. By incorporating the co-occurrence matrix as a parameter and combining both dynamic and static similarity to determine joinability, our method facilitates a more effective discovery of joinable tables. The experimental results demonstrate that when dealing with complex ERP tables, our method improves the accuracy of identifying joinable tables by almost 7% compared to traditional frameworks, particularly in small and medium-sized enterprises in which table structures are complex and design practices are inconsistent.
However, this study has certain limitations. The proposed method relies heavily on the availability of a sufficient number of time slices. While all databases generate log files, these logs are often periodically purged due to their large size in practical applications, which may result in an insufficient number of time slices, thereby affecting the accuracy of the results. Future research could consider exploring the use of the time attribute in static data in conjunction with contextual information from other attributes to mine time shards with business modularity, further improving the efficiency and convenience of discoverable joinable tables.
Overall, this research provides a new idea and method for the discovery of joinable tables. The method significantly improves the accuracy of the results and reduces the time spent in finding joinable tables by combining business modularity and similarity as joinability, which has significance for improving the efficiency of data cleaning.

Author Contributions

Conceptualization, W.W. and C.Z.; methodology, W.W.; software, W.W. and H.Y.; validation, W.W., C.Z. and H.Y.; formal analysis, W.W.; investigation, C.Z.; resources, W.W.; data curation, H.Y.; writing—original draft preparation, W.W.; writing—review and editing, C.Z.; visualization, H.Y.; supervision, C.Z.; project administration, W.W.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Zhejiang Provincial Department of Education Research Project (No. Y202351243).

Data Availability Statement

The source codes will be made publicly accessible via our website: https://github.com/wangwwwork/dtl (accessed on 1 October 2024). The data presented in this study are available upon request from the corresponding author due to containing real production and financial data information.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chepurko, N.; Marcus, R.; Zgraggen, E.; Fernandez, R.C.; Kraska, T.; Karger, D. ARDA: Automatic Relational Data Augmentation for Machine Learning. Proc. VLDB Endow. 2020, 13, 1373–1387. [Google Scholar] [CrossRef]
  2. Dong, Y.; Oyamada, M. Table Enrichment System for Machine Learning. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 6 July 2022; pp. 3267–3271. [Google Scholar]
  3. Zhao, Z.; Jian, Z.; Gaba, G.S.; Alroobaea, R.; Masud, M.; Rubaiee, S. An Improved Association Rule Mining Algorithm for Large Data. J. Intell. Syst. 2021, 30, 750–762. [Google Scholar] [CrossRef]
  4. Jang, H.-J.; Yang, Y.; Park, J.S.; Kim, B. FP-Growth Algorithm for Discovering Region-Based Association Rule in the IoT Environment. Electronics 2021, 10, 3091. [Google Scholar] [CrossRef]
  5. Gomes Dos Reis, D.; Ladeira, M.; Holanda, M.; De Carvalho Victorino, M. Large Database Schema Matching Using Data Mining Techniques. In Proceedings of the 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Singapore, 17–20 November 2018; pp. 523–530. [Google Scholar]
  6. Pan, Z.; Pan, G.; Monti, A. Semantic-Similarity-Based Schema Matching for Management of Building Energy Data. Energies 2022, 15, 8894. [Google Scholar] [CrossRef]
  7. Hättasch, B.; Truong-Ngoc, M.; Schmidt, A.; Binnig, C. It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings. In Proceedings of the 2nd International Workshop on Applied AI for Database Systems and Applications, Tokyo, Japan, 31 August 2020. [Google Scholar]
  8. Wang, J.; Lin, C.; Zaniolo, C. MF-Join: Efficient Fuzzy String Similarity Join with Multi-Level Filtering. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 386–397. [Google Scholar]
  9. Choi, D.; Wee, J.; Song, S.; Lee, H.; Lim, J.; Bok, K.; Yoo, J. K-NN Query Optimization for High-Dimensional Index Using Machine Learning. Electronics 2023, 12, 2375. [Google Scholar] [CrossRef]
  10. Zhang, H.; Zhang, Q. MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 23 August 2020; pp. 566–576. [Google Scholar]
  11. Zhu, E.; Deng, D.; Nargesian, F.; Miller, R.J. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data, Amsterdam, The Netherlands, 25 June 2019; pp. 847–864. [Google Scholar]
  12. Zhang, M.; Hadjieleftheriou, M.; Ooi, B.C.; Procopiuc, C.M.; Srivastava, D. Automatic Discovery of Attributes in Relational Databases. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece, 12 June 2011; pp. 109–120. [Google Scholar]
  13. Esmailoghli, M.; Quiané-Ruiz, J.-A.; Abedjan, Z. MATE: Multi-Attribute Table Extraction. Proc. VLDB Endow. 2022, 15, 1684–1696. [Google Scholar] [CrossRef]
  14. Dong, Y.; Takeoka, K.; Xiao, C.; Oyamada, M. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 456–467. [Google Scholar]
  15. Fan, G.; Wang, J.; Li, Y.; Zhang, D.; Miller, R.J. Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning. Proc. VLDB Endow. 2023, 16, 1726–1739. [Google Scholar] [CrossRef]
  16. Khatiwada, A.; Fan, G.; Shraga, R.; Chen, Z.; Gatterbauer, W.; Miller, R.J.; Riedewald, M. SANTOS: Relationship-Based Semantic Table Union Search. Proc. ACM Manag. Data 2023, 1, 1–25. [Google Scholar] [CrossRef]
  17. Taha, I.; Lissandrini, M.; Simitsis, A.; Ioannidis, Y. A Study on Efficient Indexing for Table Search in Data Lakes. In Proceedings of the 2024 IEEE 18th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 5 February 2024; pp. 245–252. [Google Scholar]
  18. Broder, A.Z. On the Resemblance and Containment of Documents. In Proceedings of the Proceedings. Compression and Complexity of SEQUENCES 1997, Salerno, Italy, 11–13 June 1997; pp. 21–29. [Google Scholar]
  19. Indyk, P.; Motwani, R. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing—STOC ’98, Dallas, TX, USA, 23–26 May 1998; pp. 604–613. [Google Scholar]
  20. Ukey, N.; Yang, Z.; Li, B.; Zhang, G.; Hu, Y.; Zhang, W. Survey on Exact kNN Queries over High-Dimensional Data Space. Sensors 2023, 23, 629. [Google Scholar] [CrossRef] [PubMed]
  21. Lv, Q.; Josephson, W.; Wang, Z.; Charikar, M.; Li, K. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, 23 September 2007; pp. 950–961. [Google Scholar]
  22. Korzeniowski, L.; Goczyla, K. Landscape of Automated Log Analysis: A Systematic Literature Review and Mapping Study. IEEE Access 2022, 10, 21892–21913. [Google Scholar] [CrossRef]
  23. Ma, J.; Liu, Y.; Wan, H.; Sun, G. Automatic Parsing and Utilization of System Log Features in Log Analysis: A Survey. Appl. Sci. 2023, 13, 4930. [Google Scholar] [CrossRef]
  24. Zhang, T.; Qiu, H.; Castellano, G.; Rifai, M.; Chen, C.S.; Pianese, F. System Log Parsing: A Survey. IEEE Trans. Knowl. Data Eng. 2023, 35, 8596–8614. [Google Scholar] [CrossRef]
  25. Naseer, H.; Desouza, K.; Maynard, S.B.; Ahmad, A. Enabling Cybersecurity Incident Response Agility through Dynamic Capabilities: The Role of Real-Time Analytics. Eur. J. Inf. Syst. 2024, 33, 200–220. [Google Scholar] [CrossRef]
  26. Imani, F.M.; Widyasari, Y.D.L.; Arifin, S.P. Optimizing Extract, Transform, and Load Process Using Change Data Capture. In Proceedings of the 2023 6th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Batam, Indonesia, 11 December 2023; pp. 266–269. [Google Scholar]
  27. Qaiser, A.; Farooq, M.U.; Nabeel Mustafa, S.M.; Abrar, N. Comparative Analysis of ETL Tools in Big Data Analytics. Pak. J. Eng. Technol. 2023, 6, 7–12. [Google Scholar] [CrossRef]
  28. Khan, B.; Jan, S.; Khan, W.; Chughtai, M.I. An Overview of ETL Techniques, Tools, Processes and Evaluations in Data Warehousing. J. Big Data 2024, 6, 1–20. [Google Scholar] [CrossRef]
  29. Chy, M.S.H.; Arju, M.A.R.; Tella, S.M.; Cerny, T. Comparative Evaluation of Java Virtual Machine-Based Message Queue Services: A Study on Kafka, Artemis, Pulsar, and RocketMQ. Electronics 2023, 12, 4792. [Google Scholar] [CrossRef]
  30. Zhu, E.; Nargesian, F.; Pu, K.Q.; Miller, R.J. LSH Ensemble: Internet-Scale Domain Search. Proc. VLDB Endow. 2016, 9, 1185–1196. [Google Scholar] [CrossRef]
  31. Castro Fernandez, R.; Abedjan, Z.; Koko, F.; Yuan, G.; Madden, S.; Stonebraker, M. Aurum: A Data Discovery System. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, France, 16–19 April 2018; pp. 1001–1012. [Google Scholar]
Figure 1. Framework for computing joinability.
Figure 1. Framework for computing joinability.
Electronics 13 03920 g001
Figure 2. Dynamic data generation architecture.
Figure 2. Dynamic data generation architecture.
Electronics 13 03920 g002
Figure 3. Joinability calculations and lookups.
Figure 3. Joinability calculations and lookups.
Electronics 13 03920 g003
Figure 4. Numbers of columns of each type in the ERP database.
Figure 4. Numbers of columns of each type in the ERP database.
Electronics 13 03920 g004
Figure 5. Average precision of the compared algorithms on the ERP database.
Figure 5. Average precision of the compared algorithms on the ERP database.
Electronics 13 03920 g005
Figure 6. Average recall of compared algorithms on the ERP database.
Figure 6. Average recall of compared algorithms on the ERP database.
Electronics 13 03920 g006
Figure 7. Impact of time threshold on results obtained with the proposed method.
Figure 7. Impact of time threshold on results obtained with the proposed method.
Electronics 13 03920 g007
Figure 8. Impact of I m a x on results.
Figure 8. Impact of I m a x on results.
Electronics 13 03920 g008
Figure 9. Statistics on the proportion of time slice lengths.
Figure 9. Statistics on the proportion of time slice lengths.
Electronics 13 03920 g009
Figure 10. Impact of time slice quantity on the results.
Figure 10. Impact of time slice quantity on the results.
Electronics 13 03920 g010
Table 1. Notation used in this study.
Table 1. Notation used in this study.
SymbolDescription
A , B Data collection tables in the database.
A i , B j A   collection   of   data   in   column   i   of   table   A ; collection of data in column j of table B.
a i j The   data   value   in   row   i   and   column   j   of   table   A .
d ( ) A distance function used to determine similarity.
ρ The similarity value.
Q A query column.
R A target column in the repository.
τ The similarity threshold.
l The relevance calculation function.
h m i n The minimum hash function.
h The hash function.
R t A collection of logs in time period t.
F i e l d , F i e l d r f A collection of field names in the log within time period t; attribute f in table r.
D A T A ( F i e l d ) A data collection of all fields.
G ( x ) The hash function group
s i j The dynamic similarity between columns i and j.
C A B , σ The number of co-occurrences within the time slice between tables A and B; probability of co-occurrence.
Table 2. Commonly used joined field information.
Table 2. Commonly used joined field information.
Join KeyTypeLengthDescription
Primary KeyINTDefault, 11 bitsA self-incrementing approach is typically used, usually starting at 0. This approach results in higher similarity between columns.
BIGINT20 bits and aboveA snowflake algorithm generates a unique value; there will be no duplicates.
Business CodeBIGINT20 bits and aboveCustom codes are unique in businesses, such as prefix + date + serial number.
VARCHAR10 bits and above
CHAR10 bits and above
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, W.; Zhu, C.; Yan, H. Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery. Electronics 2024, 13, 3920. https://doi.org/10.3390/electronics13193920

AMA Style

Wang W, Zhu C, Yan H. Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery. Electronics. 2024; 13(19):3920. https://doi.org/10.3390/electronics13193920

Chicago/Turabian Style

Wang, Weiwei, Chunxiang Zhu, and Han Yan. 2024. "Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery" Electronics 13, no. 19: 3920. https://doi.org/10.3390/electronics13193920

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop