The table-recognition process we propose is divided into five steps based on DCC. As
Figure 2 shows, it preprocesses the image, searches DCCs in an intelligent IoT vision device-captured image, detects the table frame by five techniques based on DCC, restores the table by inverse perspective transformation to get a table where the ruling lines are parallel, and extracts the table cells and contents for sensitive data analysis.
3.1. Directional Connected Chain (DCC)
DCC is a novel image structure element and is made up of run lengths. The definition of a run length and Directional Connected Chain (DCC) are explained in this section. Run length includes horizontal run length and vertical run length. Vertical run length is a black continuous pixel segment in the vertical direction and the segment width is only one pixel. Correspondingly, horizontal run length is a black continuous pixel segment in the horizontal direction and the segment length is only a pixel.
Figure 3 is an image consisting of pixels, in which
is a vertical run length,
is a horizontal run length, and
can represent both a horizontal run length and a vertical run length.
DCC is a series of adjacent run lengths. It includes horizontal DCC and vertical DCC. Horizontal DCC is made up of a series of adjacent vertical run lengths, and it is used to detect horizontal lines and diagonal lines whose angles are less than . Correspondingly, vertical DCC is made up of a series of adjacent horizontal run lengths, and it is used to detect vertical lines and diagonal lines whose angles are larger than .
DCC can go in any direction by connecting run lengths. Therefore, DCC can be used to represent a diagonal line. As
Figure 4 shows, the vertical run lengths through which the red line passes make up a horizontal DCC.
Perspective transformation makes table ruling lines become diagonal lines of different diagonal angles. In addition, most stamps and signatures are in an irregular shape, which are easy to distinguish from long straight lines in tables. Therefore, DCC is used in this paper to achieve table recognition.
DCC can represent the straight line of a table, so the DCC search is a key step in table recognition. As
Figure 4 shows, DCC search can be divided into two steps: determining the direction of a DCC, and predicting the run lengths that should be on the DCC.
The first step is to determine the direction of DCC. DCC is made up of adjacent run lengths, but adjacent run lengths can be either singly connected or non-singly connected. For a run length, if there is only one adjacent run length on each side, such as
, or one side has no run length and one side has only one connected run length, such as
, it is singly connected. Otherwise, it is non-singly connected, such as
. The DCC will create branches and goes to several directions if there is a non-singly connected run length. In addition, the first
run lengths will determine the direction of a DCC. Therefore, when searching a DCC, the
adjacent single connected run lengths should be found first. However, if the number of searched adjacent singly connected run length is less than
, these searched run lengths make up a DCC, and the DCC search ends, which need not go to the next process in the first step and second step. If
adjacent singly connected run lengths are found, they become a part of a new DCC, and then we get a line
l by the least-square method of fitting for mid-points of these
adjacent singly connected run lengths [
31]. The fitting line can be represented as Equation (
1).
where
k represents slope, and
b represents intercept. As
Figure 4 shows,
are the single connected run lengths that were searched for in first step, and
l is the fitting line.
The second step is to predict the run lengths, which appear after
in the first step, and determine whether they should be included in the DCC. The fitting line
l obtained from the first step is used to predict the y-coordinate of the run length at the next x-coordinate. If there is indeed a run length at the predicted position, then the run length is appended to the DCC. It means we should add the run lengths on the extension of
l to the DCC. As
Figure 4 shows, the run lengths on the dashed line are appended to the DCC in this prediction step.
and
are removed because they are not on extension of
l. Although the first
run lengths are singly connected in the first step, the run lengths appended during prediction step need not be singly connected, and the chain that was searched by above the algorithm is known as the DCC.
The following should be noted: (1) The run length whose length is longer than threshold
should be skipped in DCC fitting, such as
(we call it as abnormal run length, and
is the largest width of the lines in a photograph); (2) Once a run length is appended in the prediction process, the line
l should be refitted; (3) In the prediction process, the miss of two run lengths is allowed in order to avoid line breaks caused by pixel miss. As
Figure 5a shows, different colors represent different DCCs, which are generated from
Figure 1.
3.2. Table-Frame Detection
A DCC represents a line, and thus a ruling-line of the table should be a single DCC. But the searched DCCs in
Section 3.1 may be broken by pixel miss. In addition, other interferences such as stamps and characters, which not belong to the table frame, will also be searched as a series of DCCs. To detect the table, these valid broken DCCs should be merged, and invalid interference DCCs should be removed. By the following five techniques, the table can be detected successfully with the elimination of invalid DCCs and the detection of table ruling lines. These five techniques are: removing abnormal DCC, merging broken DCC, removing short DCC, extending incomplete DCC, and merging overlapped DCCs. In addition, during the detecting process, stamp interference and irregular handwritten signature can be filtered while removing abnormal DCCs and short DCCs.
Since removing short DCCs and extending incomplete DCCs are relatively simple compared to other tasks, a brief introduction to them is given here. There is text, light noise, etc. existing in the intelligent IoT vision device-captured image. The lines comprising them are also recognized as DCCs. To recognize a table, these interferences should be removed. By observation, the lengths of them are short, so that the lines comprising them are usually searched for as short DCCs. We can remove them by filtering DCCs, whose length is shorter than
, as shown in the following equation.
where
represents the length of DCC
,
n is the number of total DCCs,
is a coefficient. AS
Figure 5c shows, different colors mean different DCCs, and there are many short DCCs that compose characters and light noise. It can be seen clearly that those interference DCCs are removed with this technique, as shown in
Figure 5d.
A non-singly connected run length cannot be the start of a DCC, and some separate run lengths are not appended to a DCC, which means that the searched-for DCC cannot represent a complete ruling-line of a table. As shown in
Figure 5d, there are several incomplete ruling lines in the middle region, such as the blue one, the yellow one, the red one and so on. These DCCs should be extended to both sides by the prediction method introduced in
Section 3.1.
Figure 5e shows the extension result of
Figure 5d. Those incomplete DCCs are completed successfully by this technique in
Figure 5e, such as the blue one, the yellow one, the red one and so on.
The following are the details for removing abnormal DCC, merging broken DCC, and merging overlapped DCC. In addition, the techniques used in the stamp interference process are also introduced.
3.2.1. Abnormal DCC Removal
By statistical analysis on 603 intelligent IoT vision device-captured tables photographed by ourselves, it is found that concentration of the slopes of DCCs is to a certain slope in every photograph, and is approximate to Gaussian distribution, be they horizontal DCCs or vertical DCCs. The reason for this is that though the table ruling lines become skew lines in the course of perspective transformation, the sloping angle is not large. Based on this feature, the abnormal DCC will be removed by creating a statistic for the frequency of slopes and retaining the DCC, whose slope is close to the maximum.
Figure 6 shows the slope distribution of horizontal DCCs in
Figure 5b. It can be seen clearly that those abnormal DCCs, such as the dark green DCC and grass green DCC at the upper left part of
Figure 5a, are removed with this technique, as shown in
Figure 5b.
In summary, this section demonstrates that the DCCs belonging to the table should have approximative slopes, and how to find this slope to remove the abnormal DCCs.
3.2.2. Broken DCC Mergence
A ruling-line in a table should be recognized as a DCC, but sometimes it will be recognized as several broken DCCs because of missing pixels. As shown in
Figure 5b, the four borders of the table are broken to several DCCs, which should be merged to recognize a complete table frame. Given that there are two DCCs,
and
, the fitting lines of them respectively are
and
, and the representations of them are Equation (
3).
where
and
respectively represents slopes of
and
, and
and
respectively represent intercept of
and
.
Interval distance and fitting deviation are first defined here. As
Figure 7 shows, interval distance is the distance between
and
in a horizontal direction. It can be computed by Equation (
4)
where
represents the x-coordinate of the starting point on
,
represents the x-coordinate of the end point on
A fitting deviation is the deviation of mutual prediction between
and
. It can be computed by Equation (
5):
where
represents the y-coordinate predicted by
when the x-coordinate is
x.
represents the x-coordinate of the starting point on
.
represents the x-coordinate of the end point on
.
As a result, if
and
satisfy Equation (
6), then they will be merged.
where
represents the threshold of slope difference,
represents the threshold of intercept difference, and
represents threshold of interval distance between
and
. Given that
and
are respectively used to represent the length of
and the length of
,
is computed by Equation (
7):
As
Figure 5c shows, it is the broken DCC mergence result of
Figure 5b. It is clear that the broken DCCs are merged successfully, such as those four broken boundaries of the table in
Figure 5b.
In summary, this section demonstrates that two broken DCCs should be merged when they satisfy Equation (
6) to find the table frame.
3.2.3. Overlapped DCC Mergence
Overlapped DCCs will appear after broken DCC mergence and DCC extension, such as the DCCs in the middle region of
Figure 8. They may belong to the same ruling-line, but have now become overlapped because of the mistake during slope computing, and thus they should be merged. Before that, overlapped distance and overlapped deviation are defined. Given that
C represents a DCC, and
represents the distance from the point (x,y) to
C, overlapped distance will be the larger one between
and
. In addition, the overlapped deviation between
and
is:
and
are merged, if they satisfy Equation (
9),
where
represents the threshold of slope difference,
represents the threshold of intercept difference,
represents overlapped distance,
represents the threshold of overlapped distance, and
represents threshold of overlapped mistake.
Figure 5f shows the result of
Figure 5e by this technique, and the overlapped DCCs in the middle region in
Figure 5e are merged successfully.
In summary, this section demonstrates that two DCCs may overlap, which makes a table become several lines. Therefore, if two DCCs satisfy Equation (
9), they should be merged.
3.2.4. Stamp and Handwritten Signature Interference Processing
It is found that a stamp will be searched for as a series of short DCCs with various slopes to form a circle, by observing and analyzing lots of tables with stamps, as shown in
Figure 5a. Therefore, the stamp can be filtered by abnormal DCC remove and short DCC removal, as mentioned above.
First, the abnormal DCC removal is introduced here. There are great differences between slopes of DCCs in a stamp and those of a table. In addition, table ruling lines generally are the long DCCs, as shown in
Figure 5a. Based on the above features, when calculating the frequency of DCC slopes, the length of DCCs is also considered to be a parameter for the weight of the slopes, so that slopes in a stamp can be filtered. The weight of a DCC slope can be computed in Equation (
10),
where
represents a threshold, and the DCC whose length is smaller than it will have little effect on slope frequency statistic. There is obvious stamp interference in
Figure 5a, and it is clear that a part of the stamp is filtered in
Figure 5b by abnormal DCC remove.
Figure 5d shows the result of
Figure 5c after short DCC remove, and clearly the stamp is filtered completely.
In addition, a handwritten signature belongs to the category of character interference, so it can be removed like other characters during recognition using the above five techniques. As
Figure 5f shows, the handwritten signature is removed completely.
This section demonstrates that interference such as circle stamps and irregular handwritten signatures can be searched for as a series of short DCCs with various slopes. Therefore, those interferences can be filtered by abnormal DCC removal and short DCC removal.
3.3. Table Restoration and Table-Cell Extraction
To analyze the data in every table cell, we should restore the table by inverse perspective transformation, and extract the table cells.
3.3.1. Table Restoration by Inverse Perspective Transformation
The perspective transformation table can be restored easily by inverse perspective transformation using Equations (
11) and (
12).
In the equations, (
x,
y) is the coordinate in perspective transformation image, (
u,
v) is the corresponding coordinate in restoration image, and A is the perspective transformation matrix.
From the above equations, it can be seen that the critical problem of inverse perspective transformation is computing a perspective transformation matrix, which requires four points in the perspective transformation table and four corresponding points in the restoration table. The four points selected are the vertexes of the outermost frame of the table, because it can hold most the table information. To get the matrix, we first determine the outermost frame in the perspective transformation table, then compute coordinates of its four vertexes based on the intersection of the edges, and finally compute the corresponding coordinates of those four vertexes in the restoration table.
Determining a table’s outermost frame is important in computing the matrix. There are still some interference DCCs beyond the table’s outermost frame. Therefore, a table’s outermost frame-determination algorithm is proposed to remove these interferences and determine the outermost frame. Ruling lines in a table generally are long, so the DCCs, which are shorter than a certain threshold both in the horizontal direction and vertical direction, are first filtered. Furthermore, the outermost frame is a quadrilateral, so its horizontal edges can cover both vertical edges. Based on this feature, as
Figure 9 shows, vertical DCCs can be used to determine the range of horizontal DCCs in a table. Correspondingly, horizontal DCCs can be used to determine the range of vertical DCCs in a table. It means that the leftmost x-coordinate of vertical DCCs is the left boundary on horizontal direction of a table, and the rightmost x-coordinate of vertical DCCs is the right boundary on horizontal direction of the table. After that, DCCs that are out of the boundary are filtered. In addition, the leftmost vertical DCC, the rightmost vertical DCC, the topmost horizontal DCC and the bottommost DCC are selected as the outermost frame of the table. Algorithm 1 shows this Table outermost border detection algorithm.
Figure 5g shows the outermost frame of the table in
Figure 1 by this algorithm.
After determining the outermost frame of the table, the intersection points of its edges will be computed as its four vertexes, which can be represented as
.
In addition, given that the length of the longer horizontal edge is , and the length of the longer vertical edge is , we should restore the table, instead of the whole image, so the corresponding vertexes of in restoration image is
.
These eight coordinates are put into Equations (
11) and (
12), then the perspective transformation matrix will be computed, and the table is restored by inverse perspective transformation.
In summary, this section demonstrates that a perspective transformation table should be restored by inverse perspective transformation. To implement it, the perspective transformation matrix is computed by obtaining the four vertexes of the table outermost frame. Algorithm 1 shows how to detect table outermost frame.
Algorithm 1 Table outermost border detection. |
Input: Output: The four point of the table outermost border: - 1:
- 2:
- 3:
- 4:
- 5:
- 6:
for in do - 7:
- 8:
if then - 9:
- 10:
end if - 11:
if then - 12:
- 13:
end if - 14:
if then - 15:
- 16:
end if - 17:
end for - 18:
- 19:
for in do - 20:
- 21:
if then - 22:
- 23:
end if - 24:
if then - 25:
- 26:
end if - 27:
if then - 28:
- 29:
end if - 30:
end for - 31:
- 32:
- 33:
whileanddo - 34:
del - 35:
end while - 36:
whileanddo - 37:
del - 38:
end while - 39:
whileanddo - 40:
del - 41:
end while - 42:
whileanddo - 43:
del - 44:
end while - 45:
iforthen - 46:
- 47:
end if - 48:
- 49:
- 50:
- 51:
|
3.3.2. Table-Cell Extraction
In the restoration table, to analyze the data in every table cell, the table cells must be extracted, and the table contents should be recognized.
For table-cell extraction, the important step is to detect the table frame. The method of table-frame detection was discussed in
Section 3.2. First, the DCCs are searched for in the restoration table by the method introduced in
Section 3.1, and then the table frame is detected by the methods mentioned in
Section 3.2. After that, the vertexes of all the table cells will be obtained by computing intersection points of horizontal ruling lines and vertical ruling lines.
Figure 5i shows the table frame recognized in the end.
For table contents, the recognition method is quite mature [
32,
33], and thus it will not be discussed in this paper.
By the above steps, the whole table recognition can be recognized, and then the data can be analyzed to determine if sensitive data exists.