1. Introduction
With the rapid development of industrial science and technology, products are becoming smaller and more complex. High-precision requirements are particularly crucial for improving assembly and reducing the need for inspection. Long working hours can easily lead to worker fatigue, and different workers have a different production accuracy. Therefore, product accuracy is difficult to maintain. In recent years, machine vision [
1] has been widely used in auxiliary identification. The images are captured by a charge-coupled device (CCD) camera, image processing, and identification to achieve a consistency of identification results.
Industry 4.0 is oriented toward microelectromechanical and micro–nano systems, and relevant product development aims at lightness and portability. Therefore, semiconductor processing equipment and machine tools will develop in this direction. To meet the requirements for producing miniature workpieces, the machining accuracy demands of the machine tool industry have gradually increased. The precision positioning stage can enable the accuracy demands for miniature machining to be satisfied.
Geometrical alignment in stacked XYθz stages usually causes cumulative errors, such as in parallelism, orthogonal between two axes, and flatness errors. In [
2,
3], the devised coplanar XYθz stages were used to avoid such errors. They exhibited superior positioning accuracy compared with the stacked type stages. In [
4], the feature of alignment was performed in wafer manufacturing by using two CCD cameras with an XYθz stage and a feed-forward neural network controller. The positioning precision was approximately 12 μm. The experiment employed a cross mark for alignment. Cross marks are commonly used in many visual alignment systems because they have clear features and can be easily identified. In [
5], a fuzzy logic controller was used to control the motion of the alignment stacked XY stage with the imaging software eVision for image processing.
In [
6], an XXY stage mask alignment system with dual CCD image servos was designed. On the basis of the motion method of an artificial neural network, nonlinear mapping of the stage at the required position was conducted, and the directions of the commands of three motors were established. In [
7], the XXY stage was integrated with dual CCDs to establish an automatic alignment and recognition system. Automated optical inspection (AOI) technology was proposed to guide the stage and combine image technology with stage control technology to perform image recognition and alignment tasks. In [
8,
9], a specially designed coplanar stage for visual servo control and an image alignment system were proposed, and the alignment movement and error of the image on the coplanar XXY stage were analyzed. Subsequently, the influence of kinematics analysis and setting errors were discussed. The image alignment method for a floating reference point was proposed to reduce the effect of the alignment error between the center of the workpiece and the reference point of the stage. In [
10], a microcontroller was used for an image-based XXY positioning platform. The positioning error between the XXY stage and the detected object was determined by image recognition technology. Subsequently, the error information could be used for positioning control. In [
11], automatic locating and image servo alignment for the touch panels of a laminating machine was employed via four CCD cameras and a coplanar XXY stage also.
A long short-term memory (LSTM) network [
12] is a variant of a recurrent neural network (RNN) and was first proposed in 1997. Because of their unique design, LSTM networks are often used to handle time-series data problems to solve the vanishing gradient problem. LSTM networks are complex nonlinear units that can be used to construct a large deep learning network. In recent years, many studies have used LSTM networks to address different problems related to time-series data.
In [
13], to ensure that the positioning and navigation systems of intelligent network vehicles can still output a location with high accuracy when global navigation satellite system (GNSS) positioning failure occurs, a high-precision positioning method based on LSTM was proposed. Experimental results indicated that the method’s performance met the high-precision positioning and navigation requirements of intelligent network vehicles on urban roads. The results demonstrated that the LSTM network could approximate the relationship between the input and output of a GNSS-integrated navigation system (INS) with high precision. In [
14], a visual recognition system was combined with an artificial-intelligence machine-learning neural network to predict the maximum pick-and-place offset of a robot arm in the next minute. The developed LSTM model made predictions with high accuracy and reliability and met handling robot needs. In [
15], the LSTM model predicted the remaining life of gears, and the LSTM networks could capture both short-term correlation and long-term dependence. In [
16], a system was proposed for accurately detecting the lane line (LL) on roads to improve vehicle driving safety. An LL prediction model based on LSTM was established according to the spatial information of LLs and the distribution law of LLs, and future LL location was predicted using historical LL location data. In [
17], the use of LSTM to improve the PSO algorithm by finding the best fitness value on the XXY stage was conducted with three types of motion. The developed LSTM could predict the fitness value of PSO by eliminating the need to preassess the fitness value, and adjusted the inertia weight of PSO adaptively. The experimental results indicated that LSTM could reduce the time to find optimal control parameters, and the stage positioning error for the XXY stage could be reduced significantly.
In the present study, the positioning information of an XXY micromotion stage was used to construct a predictive model with a time series through machine learning. First, stage displacement data were collected through imaging, and the collected data was used to establish a training dataset as the training and verification subset of the neural network. Subsequently, the LSTM model predicted the next movement. Finally, the stage was adjusted according to the results predicted by the model to achieve the optimal positioning requirements.
The rest of this paper is organized as follows.
Section 2 describes the experimental setup, including the XXY stage, vision system, and controller;
Section 3 introduces the LSTM network used in this study;
Section 4 provides details on the experimental process;
Section 5 describes the analysis of the experimental results; and
Section 6 provides the conclusions of this study.
2. Image Capture System
An XXY stage is characterized by three motors on the same plane and has the merit of a low center of gravity. Thus, the movement speed of an XXY stage is higher than that of a traditional stacked XYθ stage. The XXY stage is small and light. Therefore, the main advantage of an XXY stage is that it exhibits a smaller cumulative error in stage composition than does a traditional stacked stage. Therefore, a coplanar XXY stage is highly popular for applications that require precision motion, such as AOI and lithography.
As illustrated in
Figure 1, the experimental system consisted of an upper mask chip device, which carried the upper cross mask for CCD imaging; a lower coplanar XXY stage (XXY-25-7, CHIUAN YAN Ltd., Changhua, Taiwan) [
18], which carried the lower part to align the upper device with the image servo control; and two CCD camera lenses, which were mounted on top of the system as the image servo sensors for positioning. A motion card (PCI-8143, ADLINK TECHNOLOGY Inc., Taoyuan, Taiwan) controlled the XXY stage [
19], and ADLINK’s Domino Alpha2 image card provided the XXY stage with image position feedback.
Traditional XYθ stages use a stacked design, which consists of an
x-axis translation stage, a
y-axis translation stage, and a θ-axis rotational stage. The controller design for a traditional stacked XYθ stage is simple because each axis moves independently. However, the XYθ stage produces cumulative flatness errors because of its stacked assembly and large size. Therefore, a coplanar XXY stage was developed because the coplanar design produces relatively low cumulative errors and can move faster than does the traditional XYθ stage.
Figure 2 displays the structure of the coplanar XXY stage, which is driven by three servo motors: an x1-axis motor, an x2-axis motor, and a
y-axis motor. The working stage is supported by four substages, each of which consists of
x-axis translation,
y-axis translation, and θ-axis rotation stages. Therefore, the motion of the XXY stage has three degrees of freedom: translation along the
x-axis and
y-axis and rotation around the θ-axis.
Figure 3 illustrates the
y-axis movement.
Table 1 provides the specifications for the XXY stage used in this study (CHIUAN YAN Ltd.) [
18], and
Table 2 presents the specifications of the CCD device.
This study used the OpenCV software 4.5 version (Intel Corporation, Mountain View, CA, USA) for image processing. The image of a cross mark captured by the CCD camera was used for image processing to obtain stage positioning. The dimensions of the cross mark were 1 mm × 1 mm. Image processing involved grayscaling, binarization, and erosion and expansion to eliminate noise. Finally, the image center in the gravity method was used to obtain the image coordinates, and the center of gravity coordinates were used to calculate the image displacement.
When the stage was moving, the CCD captured an image and saved it as an image file, and the execution time of the CCD was 0.5 s. Each pixel of the image contained three colors—red, green, and blue. After the image was grayscaled, each pixel was changed to black or white with a brightness between 0 and 255. The brightness was lower for dark colors and higher for light colors. Subsequently, binarization was used to convert the pixel information to 0 (black) or 255 (white). The conversion method depended on the threshold set, and the threshold was 100. If the pixel value was above the threshold, the pixel value was converted to 255. If the pixel value was below the threshold, the pixel value was converted to 0. After the image was binarized, the image only had two colors—black and white—which facilitated image processing (
Figure 4).
After the image was binarized, the image had stray white and black dots because of the influence of light and the threshold settings. Erosion processing eliminated the small white dots (
Figure 5), and dilation processing eliminated the small black dots (
Figure 6). Erosion processing resulted in the conversion of the white pixels around a black pixel into black pixels, and dilation processing resulted in the opposite effect. The aforementioned image processing methods can distort the original image if the range of erosion and dilation is excessively large; therefore, a dilation and erosion range of three pixels was set.
A reference point is required in image positioning. In this study, the reference point was the cross mark. The image center of gravity method was used to find the center of gravity of the cross mark, and the stage displacement was then calculated. This method is a simple method used to find image reference coordinates. The image centroid method was used to find the fiducial coordinates of an image, as shown in
Figure 7. In this method, the position coordinates of the pixels of the reference point are added, and the obtained values are divided by the total number of pixels of the reference point to obtain the coordinates of the center of gravity. Subsequently, the stage displacement can be calculated using these coordinates.
In (1) and (2), is the x-coordinate of the center of gravity, is the y-coordinate of the center of gravity, is the sum of the x-coordinates of the white cross image, is the sum of the y-coordinates of the white cross image, and N is the sum of the white pixels of the white cross image.
3. LSTM Network
LSTM networks are variants of RNNs [
20] and inherit the characteristics of RNN models. Because the new memory of an RNN model overwrites the old memory in the recursive layer, the flow of memory cannot be controlled individually, so LSTM networks can control the flow of memory through the gate. As displayed in
Figure 8,
represents the input data of the RNN at the current time point, and
represents the output of the hidden layer at the previous time point. If the time increases gradually, the generated time sequence becomes excessively long, and the neural network would be unable to learn the information at the beginning of the dataset. This problem is called gradient vanishing or gradient exploding, and the adjusted gradient weight of the RNN model would be excessively large or small. Therefore, training long-term sequences and obtaining model predictions are difficult tasks for an RNN model.
An LSTM network possesses a memory structure that contains memory cells. It adds and memorizes information as a time series progresses, thereby solving the vanishing gradient problem.
Figure 9 illustrates the basic structure of an LSTM network. The cell state can be used to store and transmit memory; therefore, the information in this state can be written or deleted. Without external influences, the aforementioned information remains unchanged. The parameter
represents the input data at time
, and
is the hidden state at time
. The cell state at time
is denoted as
, which is modified from the present cell state
in the hidden layer at time
.
The hidden layer of an LSTM network contains an input node (
) and three gates (
). The variables
are calculated using (3)–(6), respectively. The input node
is used for updating the cell state, and the gates are used to determine whether to allow information to pass through. The three gates in an LSTM network are a forget gate, an input gate, and an output gate. The forget gate (
) determines which cell states’ (
) information may pass through. The input gate (
) determines which input nodes’ information (
) may pass through. The vectors (information) passing through the input gate are used for updating the cell state and are subjected to element-wise addition by the vectors (information) passing through the forget gate to generate the cell state (
). The calculation in the aforementioned process is expressed in (7). The output gate determines which cell state’s (
) information may pass through it. The vectors (information) passing through the output gate are in the hidden state (
), and they are the output vectors of the current hidden layer. The calculation method for
is presented in (8). In addition, the cell state and hidden state obtained at time
, namely (
) and (
), respectively, are transmitted to the hidden layer at time
. This process, which progresses with a time series, is used for the transmission and learning of memory.
where
and
represent the weight,
denotes the bias,
is the symbol for element-wise addition,
is the symbol for element-wise multiplication,
denotes the hyperbolic tangent, and
denotes the sigmoid function. The parameters
and
represent activation functions.
6. Conclusions
In this study, an LSTM model was used to construct an XXY stage positioning prediction system. By using the XXY stage’s image feedback compensation system, the positioning error was reduced to one pixel. The image compensation was limited by the imaging equipment used. Therefore, a dial indicator and an LSTM network were used to construct a positioning prediction model. The maximum positioning error was 7.717 μm, the average error was 2.085 μm, and the root MSE was 2.681 μm. The LSTM network exhibited favorable repeatability. Moreover, relatively small positive and negative errors were observed when using the LSTM network, and the trend of the neural network prediction error curve was similar to the actual error curve. The experimental results indicated that an LSTM model can be used for various types of motion positioning prediction, such as forward, backward, and return motion prediction, for an XXY stage. An error was observed between the actual displacement of the stage and the feedback displacement of the encoder. Therefore, an LSTM model with a time-series relationship was established using actual movement information collected by the dial indicator. This model can be used for displacement compensation in control systems for XXY stages.