In this section, we first introduce the concepts about sub-tree, describe, and analyze the proposed algorithm.
5.1. Important Concepts about Sub Trees
The TNT-HUI algorithm is a recursive algorithm that iterates over sub-tree of the global TN-Tree initially constructed. To clarify the description of the algorithm, we firstly give the following definitions.
Definition 7 (base-itemset and Conditional Tree)
. A conditional tree (also called a sub-tree) [6] of itemset X is a tree that is constructed using all transaction itemsets containing itemset X (X is removed from these transactions itemsets before they are added to the conditional tree). Itemset X is called the base-itemset
of this conditional tree. A tree that is constructed by all transaction itemsets of a dataset and, in which its base-itemset is null, is called a global tree. A global tree also is called a conditional tree in which its base-itemset is null. in a transaction itemset t containing X is also called base-utility (abbreviated as bu) of transaction itemset t in the conditional tree T.
Definition 8 (Sub Dataset). In a conditional tree T in which its base-itemset is X (if X is null, T is a global tree), suppose item Q appears in k tail nodes, and the corresponding path-itemsets are , , ..., , the itemsets , , ..., (along with their utility values) constitute the sub dataset of itemset . Each record in sub dataset is called sub transaction-itemset.
Definition 9 (Local Candidacy). If the twu value of an item in a sub dataset is less than the minimum utility value, it is called a local unpromising item; otherwise, it is called a local promising item.
According to Theorem 1, the algorithm TNT-HUI removes all unpromising items from original transaction itemsets when it creates the TN-Tree with transaction itemsets. In addition, according to Theorem 1, the algorithm TNT-HUI removes all local unpromising items of a sub dataset when it creates a sub TN-Tree.
5.2. Algorithm Description
The algorithm of mining HUIs from a TN-Tree is shown in Algorithm 2.
We process each item (denoted as Q) in the header table H, starting from the last item, by the following steps.
First, we find the total , , and values for each note of item Q. Since nodes are added to the tree according to the order of the header table, these nodes were all tail nodes and each of them contains tail information. We count the sum of those nodes’ base-utility (Line 4) and the sum of those nodes’ path-utility, (Line 5). For each tail-node , we find item Q’s utility from list and denote it as . Then, we sum them up as (Line 6).
Then, if is less than the predefined minimum utility value, go to Step 3; otherwise, we add item Q to a base-itemset (which is initialized as an empty set) and generate HUI and create sub TN-Tree to perform mining recursively (Lines 13–14). More specifically, if is not less than the predefined minimum utility value, then the current base-itemset is an HUI (Lines 10–12). We remove the item Q from the current base-itemset after we perform a recursive mining process on the new sub TNT-Tree (line 16).
Finally, for each of these k tail-nodes (which we denote as , ), we modify its tail-information by deleting item Q’s utility from list and modifying as: . If its parent node contains a tail-information, then accumulate this tail-information to its parent’s tail-information (lines 27–30); otherwise, move this tail-information to its parent (lines 22–25).
The constructing process of sub tree of the Algorithm 2, summarized in Algorithm 3, is as follows. First, we create a new header table subH by scanning the corresponding path-itemsets in the current TN-Tree (lines 1–8), including deleting unpromising items from subH and sorting its items in the descending order of sn/twu (lines 7–8). Second, we process each path-itemsets in the current TN-Tree, including deleting unpromising items(line 14), sorting items according to subH (line 15), and inserting the path-itemsets to a new TN-Tree subT (lines 16–20).
Algorithm 2: TNT-HUI |
|
Example 3 (HUI Mining based on TN-Tree)
. For example, in Figure 1, item “B” is the last item in the header table. We firstly calculate its , , . Because , we add item “B” to a base-itemset (initialized as null), resulting base-itemset B. Because , this itemset B is not a high utility itemset. However, as , we still construct a sub-header table and a sub TN-Tree for the current base-itemset {B}.A sub-header table is created by the following. From the path “root-C-D-E-B: ” in Figure 1, get an itemset , the support number of each item (i.e., 1) and utility of items C, D, E, B and itemset (i.e., 3, 6, 3, 12 and 24), respectively. See the first sub transaction-itemset of the sub dataset in Figure 3a. Similarly, we can get other three sub transaction-itemset from the other three paths, respectively: root-C-D-A-E-B: , root-C-E-B: , and root-C-B: . See the sub dataset in Figure 3a (the digital associated with each item, such as 3 in , is the utility value of this item). A sub-header table is created by scanning the sub dataset in Figure 3a, the result is shown in Figure 3d. A sub-header table just maintains all local promising items. A sub TN-Tree is created by the method of TN-Tree in Section 4.2, except that the utility values of itemset of each sub transaction-itemset in Figure 3a are accumulated to the field on the tail-node and the item B is not added to the sub TN-Tree. The result is shown in Figure 3d. Then, we perform a recursive mining process on the new sub-header table and sub TN-Tree. For the last item E in the header table in Figure 3d, , , . Because , item E is not added to the base-itemset because , no new sub TN-Tree or HUI is generated. It is the same for item C of the header table in Figure 3d. After processing all items in Figure 3d, we go on processing remaining items of the header table in Figure 1. Figure 3b is the sub dataset of itemset , and Figure 3e is the corresponding sub TN-Tree of itemset . For the last item D in the header table in Figure 3e, , , . Based on the same judgment as already described above, item D is added to the current base-itemset, resulting base-itemset ; although is still not an HUI, it satisfies the condition of constructing a sub TN-Tree. From Figure 3e, we can get the sub dataset of itemset in Figure 3c, and the new sub-tree is shown in Figure 3f. Now, , , , and because , add item C to the base-itemset and get base-itemset {ADC}. Now, , which means is an HUI. Next, we go on processing the remaining items of the sub-header table in Figure 3e. Then, we go on processing the remaining items of the header table in Figure 1. The “add/move” process (Step 3 of Algorithm 2.) is a key operation of this algorithm. When a transaction itemset (or sub transaction-itemset) is added to a TN-Tree, its support number, base-utility and its each item’s utility are stored in its tail-node, not in the node itself. Moreover, since a node can appear in multiple branches, its support number, base-utility, utility, etc., should be the sum of the corresponding values of all its tail-nodes. So, tail-information of one node should be passed to its parent node after this node is processed. For example, after processing node
B:
in
Figure 1, according to Step 3, remove
B’s utility (12) from
(
) and subtract it from
(24), resulting a new tail-information
. Since
B’s parent node
E does not contain tail-information, we move this new tail-information to this node
E, resulting in
E:
(see
Figure 4). In the same manner, tail-nodes
B:
and
B:
in
Figure 1 were processed and moved to their parent nodes, resulting in
E:
and
C:
. Tail-node
B:
was added to its parent node (because its parent node contains tail-information), resulting in
E:
; see
Figure 4.
Algorithm 3: Create sub-header table and sub TN-Tree |
|
5.3. Analysis of the Algorithm
Property 1 (TN-Tree Completeness). Given a transaction dataset and a minimum utility value , its corresponding TN-Tree contains the complete information of in relevance to HUIs mining.
Proof. Based on a TN-Tree construction process, all transactions itemsets that contain the same (local) promising items are mapped to one path (for example,
and
in
Table 1 are mapped to one path in
Figure 1) and have shared the same tail-node. The sum of utility of each item in those transactions are stored to the field
on the tail-node. Thus, the utility of an itemset
X in
can also be retrieved from the corresponding tail-nodes. □
Property 2. Let be a dataset, be a sub dataset of itemset X, and itemset Y be in and . Then, the utility of in is equivalent to the utility of in , and itemset is an HUI if and only if it is an HUI in .
Proof. Based on the sub dataset construction process in Example 3 and Definition 8, all transactions containing itemset are mapped to . Thus, the utility of itemset in is equivalent to the utility of in . So, itemset is an HUI in if and only if its utility in is not less than the minimum utility value in . □
Property 3 (TNT-HUI Correctness). Given a base-itemset X, in which base utility is BU, for any promising item Q in the subDB, (1) if , then any superset of itemset is not an HUI; (2) if , then itemset is an HUI; otherwise, it is not an HUI.
Proof. (1) Firstly, based on the (sub) header table construction process, the value in a (sub) header table includes the utility values of (local) unpromising items in the corresponding transactions. Secondly, after an item in a (sub) header table is processed, the algorithm TNT-HUI have mined all HUIs containing this item. So, this algorithm needs not consider those processed items when it processes the remaining items in a (sub) header table. Based on these two reasons mentioned above, we need re-calculated the value of an item in a (sub) header table. In the algorithm TNT-HUI, the value is the new of itemset , and it does not include the utility values of the two kinds of items mentioned above (unpromising items and processed items). According to Theorem 1, any superset of itemset is not an HUI if is less than the minimum utility value.
(2) Let be the sub dataset of itemset X (if X is null, is the original dataset). Based on the sub TN-Tree construction process, the value is the utility of itemset in . According to the Property 2, itemset is a high utility itemset if and only if is not less than the minimum utility value. □
Property 3 guarantees all itemsets mined by the algorithm TNT-HUI are HUIs. For example, in Example 3, the utility value of each new base-itemset () is obtained from the tree, so it is an HUI if its utility value is not less than the minimum utility value. Note that, in the special case where X is null, a sub TN-Tree is a global TN-Tree.
Property 4 (TNT-HUI Completeness). The result found by TNT-HUI is complete. In other words, all HUIs can be discovered by TNT-HUI.
Proof. Assume to be an HUI, and any sub-itemset of X is promising itemsets. The algorithm TNT-HUI creates sub-header table and sub-tree for , , ..., . Finally, it can get from sub-tree of the itemset . So, TNT-HUI can mine all HUIs. □
Property 5. The algorithm TNT-HUI may not construct a new sub-tree and generate a HUI when it processes an item of a header table.
Proof. When the algorithm TNT-HUI is processing an item of a header table, the utility of the item (
), the utility of the current base-itemset (
), and the sum of its corresponding path-utility (
) are obtained from its tail-information (shown in Step 1 an 2 in
Figure 2). Because this sum of
and
does not include the utility of (local) unpromising items and processed items, it will be less than its
value in the header table. If the sum of
and
is less than the minimum utility value, TNT-HUI will not construct a new sub-tree. If the sum of
and
is less than the minimum utility value, a HUI is not be generated. □
For example, when processing the item
E in the header table in
Figure 3d,
,
,
, because (
) and (
) are less than the predefined minimum utility value 70, the algorithm TNT-HUI does not construct a new sub-tree for the itemset
and generate an HUI
.
5.4. Comparison with Existing HUI Mining Algorithms
Tree structures have been used to represent transaction databases for pattern mining. For example, for the dataset in
Table 1 and the profit table in
Table 2, a global IHUP-Tree is shown in
Figure 5, in which items are arranged in the descending order of
values. In the second step, IHUP generates candidates for HUIs from the IHUP-Tree by employing the FP-Growth method [
6]. In the third step, IHUP scans the dataset to find all HUIs from the candidates. During the construction of an UP-Tree, the unpromising items and their utilities are eliminated from the transaction utilities, and the utilities of its descendants of any node are discarded from the utility of the node. For any itemset
X, the value of TWU(
X) in the UP-Tree is not bigger than that in the IHUP-Tree, so the number of candidates created by the algorithm UP-Growth is not bigger than that created by the algorithm IHUP.
The structures of the header table in algorithms IHUP and UP-Growth contains item,
value and link information, as shown in
Figure 5. The structures of IHUP-Tree and UP-Tree are identical: each node on them contains item, support number,
(or a value derived from
value), link to parent, link to children, and link to the next node.
When a transaction itemset is inserted to an UP-tree, each node does not contain utility values of its children nodes. So, UP-Growth’s over-estimated utility value (used for judging whether an itemset is a candidate) is lower than that of IHUP. So, this effectively reduces the number of candidate and improves the time efficiency of the judging of candidates. After mapping transaction itemsets to a TN-tree, the itemsets’ exact utility values can be retrieved from the tree, so TNT-HUI mines HUIs without generating candidates.
Property 6. In the sub-trees and sub-header tables of the same base-itemset, the number of items (promising items) in the sub-header table and the number of nodes on the sub-tree created by the algorithm TNT-HUI are not more than that created by the algorithm UP-Growth [16] or the algorithm IHUP [11]. Proof. When a transaction itemset (or sub transaction-itemset) is added to a tree by the algorithms UP-Growth or IHUP, the utility value of each item is stored in the corresponding node (shown in
Figure 5c,d), so the actual utility of each item of each path on the UP-Tree or IHUP-Tee cannot be obtained when a sub-tree is constructed. An over-estimated utility value of each item is used to construct the sub-tree and the new sub-header table. On the other hand, TNT-HUI creates a sub-tree and a sub-header table with the actual utility of each item of each path on a TN-Tree. Since the over-estimated
is not less than the actual value, the number of (local) promising items in the sub-header table, and the number of nodes on a sub-tree, will be not larger than that in the algorithms UP-Growth or IHUP. □