1. Background and Problem Statement
A hard disk drive (HDD) [
1,
2,
3,
4] is a storage device used to store digital data in a computer. The two crucial components of an HDD are the media disk and the slider. The slider is responsible for reading from and writing data to the media disk. To ensure the quality of the slider, the slider dynamic electrical test (SDET) process is conducted using the Amber testing machine in conjunction with an HDD simulator component (referred to as the component).
In the event of a component failure during testing, the component will be ejected from the Amber testing machine with an error code and sent for repair. After the repair, the component will be returned to the Amber testing machine for re-testing. However, the repaired component may encounter a looping problem, characterized by being ejected with the same error within six hours of operation. This issue indicates that the component repeatedly fails with the same error even after being repaired, presenting a significant challenge to maintaining the efficiency and reliability of the testing process.
An HDD [
5,
6,
7,
8,
9] is a fundamental storage device used in computers to store various types of digital data, including images, documents, videos, and more. The HDD [
10,
11,
12,
13,
14] consists of several critical components, among which the media disk and the head (commonly referred to as the “Slider”) are paramount. The media disk serves as the physical medium in which digital data are stored, while the slider is responsible for reading and writing data to and from the media disk. Ensuring the quality and reliability of the slider is crucial for the overall performance and longevity of the HDD.
To maintain high standards of quality, the slider undergoes a rigorous testing process known as the slider dynamic electrical test (SDET). This process is essential for identifying and eliminating defective sliders before they proceed to the next stage of the assembly process. The SDET is conducted using a specialized machine, referred to as the Amber testing machine, which operates in conjunction with a simulated HDD device known as the blade. The blade simulates the working environment of the HDD [
15,
16,
17,
18,
19,
20], allowing for accurate testing of the slider’s performance under real-world conditions.
Despite the critical importance of the SDET process, challenges have arisen, particularly in the handling of components that fail during testing. When a blade encounters an issue during testing, it is ejected from the Amber testing machine with an associated error code and sent to a repair station. At the repair station, technicians attempt to rectify the identified issue and then return the blade to the Amber testing machine for re-testing. However, a recurring problem has been observed, in which some blades are repeatedly ejected with the same error code, even after undergoing repairs. This phenomenon, known as “Blade Looping,” presents a significant challenge to the efficiency and reliability of the slider testing process.
From April 2021 to February 2023, the issue of blade looping was recorded in 34% of all cases involving the product A blades shown in
Figure 1, with error code A being particularly problematic, accounting for 3.3% of all blade looping occurrences. This recurring issue not only disrupts the testing process but also reduces the overall testing throughput, resulting in fewer sliders being tested than originally anticipated. On average, only 540 out of an expected 4000 sliders were tested due to the impact of blade looping, representing a significant shortfall in testing capacity.
In investigating the causes of blade looping to address the issue of blade looping, several hypotheses have been proposed regarding its underlying causes. The first hypothesis concerns the limited repair time available to technicians. Given the complexity of the blade repair process, which involves multiple intricate procedures, technicians may struggle to complete repairs effectively within the allocated time. The second hypothesis focuses on the complexity of the repair steps themselves, suggesting that the numerous procedures required during the repair process may be contributing to the recurrence of the same error code. Finally, the third hypothesis highlights the varying levels of experience among technicians, with less experienced technicians potentially lacking the necessary knowledge to prioritize critical areas during the repair process.
Considering these challenges, this study aims to develop a recommendation system designed to enhance the repair process and reduce the incidence of blade looping. The proposed recommendation system will provide technicians with actionable guidance tailored to address the specific issues associated with error code A in Product A blades. By leveraging insights from historical data and incorporating machine learning techniques, the system will help technicians perform repairs more accurately and efficiently, ultimately improving the overall quality and reliability of the slider testing process.
The remainder of this paper is organized as follows:
Section 2 provides a theoretical framework and a review of related works, including definitions of recommendation systems, an overview of artificial neural networks, and a discussion of relevant evaluation methods.
Section 3 outlines the methodology employed in the development of the recommendation system, detailing the data collection process, the model selection, and the implementation strategy.
Section 4 presents the experimental results, including an analysis of the system’s performance and its impact on reducing blade looping occurrences. Finally,
Section 5 concludes this paper by summarizing the key findings and offering suggestions for future research.
4. Result
The results for each part of this project are discussed here.
4.1. Recommendation System for Component Members
This project created nine models (From
Table 4) for different parameters, and the results are shown in
Table 6. We obtained the best model at bottleneck = 20 and number of epochs = 300; it provided MAE = 0.8434, MSE = 3.3293, and RMSE = 1.8246. Additionally, the results are shown in
Appendix A (
Figure A1,
Figure A2 and
Figure A3).
4.2. Recommendation System for a New Component
We set the top action, TYPE B, as a reference and calculated the cosine similarity scores for other actions. We found that TYPE C had the highest similarity score, followed by TYPE D and TYPE A, respectively.
4.3. Automatic System
To improve the model’s performance, we included an automatic training facility that activates whenever new data are received. As the data volume increases, performance is expected to improve. Using only the first-period data to train the autoencoder resulted in a mean absolute error (MAE) of 0.9301. After appending second-period data, the MAE decreased to 0.8434. This reduction in MAE indicates that an increase in the data volume leads to better performance.
4.4. Applying the Recommendation System
From simulated scenarios, we conducted experiments and observed that the system operated correctly and provided comprehensive recommendations across all situations. These findings are summarized in
Table 7.
4.5. Validation
The recommendation system selected 102 components to be rerun in the Amber testing machine, while 104 components remained fixed. We discovered that 13 components failed during the ET process for unrecoverable reasons. Additionally, out of the 21 components experiencing the looping issue, only 5 were fixed using the recommendation system again. Two components did not experience the looping problem, whereas three components did.
It can be concluded that product component A experienced the looping problem with error code A occurring at a rate of 20.59%. This is a reduction of 13.41% when compared to the original problem statement, which reported a looping rate of 34%. The technicians followed specific recommendations and actions regarding the five components in which the looping problem persisted.
In
Table 8, it is shown that component 8864 performed no action the second time around but did not experience the looping problem. Component 9512 performed a different action the second time around and also did not experience the looping problem. Finally, component 7477 was repaired three times consecutively but still experienced the looping problem each time despite being checked in all positions.
5. Conclusions
The objective of this project was to develop a recommendation system aimed at reducing the incidence of looping issues, defined as a component failing with the same fault within 6 h, by addressing two scenarios: existing component members and new components. For existing components in the FA list, a user-based collaborative filtering approach with implicit ratings was employed, while new components were managed using an item-based collaborative filtering technique with cosine similarity scores. Through the implementation of user-based filtering across nine models with varying parameters, the optimal model was identified after overcoming 20 bottlenecks and completing 300 epochs, yielding a mean absolute error of 0.8434 and a root mean square error of 1.8246. The system provided an action list recommendation for new components, including options such as TYPE A, TYPE B, TYPE C, and TYPE D. Upon deployment in a manufacturing line, the system reduced the looping rate of component Product A with fault code A from 34% to 20.59%, a reduction of 13.41%. However, among the components that continued to face looping issues after repair, two did not loop after a second repair, while three continued to loop despite thorough examination in all positions. These findings suggest that, while the system effectively reduces looping issues, further investigation is necessary to identify the root cause, potentially involving the interaction between the existing system and the Amber testing machine. Future research should focus on applying supervised machine learning techniques with expanded data collection and root cause analysis to ascertain the direct origin of component failures.
Suggestion
This project is focused on developing a recommendation system, rather than a predictive model. Consequently, the recommended actions may not always be accurate or comprehensive. Additionally, the dataset used to train the recommendation system is relatively small, which limits the system’s effectiveness. To enhance efficiency, it is necessary to collect more data. As the dataset grows, the system’s performance is expected to improve. Furthermore, incorporating additional error codes into the system would better support the operations of technicians.
Regarding the looping blade issue, it is crucial to identify the exact root cause. Understanding the specific problem allows for the exploration of various alternative solutions. The recommended actions should ideally be derived from predictions based on supervised machine learning. By identifying signals that clearly indicate potential failures at specific positions, the system could provide technicians with precise locations for repair. However, it is essential to have a thorough understanding of the data or to collect data directly related to the blades, preferably through automated sensors rather than through manual input.