5.1. Overall Results
Table 5 summarizes the overall results of our endurance testing experiments. All devices were tested from the brand new status to device failure, and all of them indicated the lifetime via the “wear out indicator” attribute of S.M.A.R.T, which we refer to as the
default indicator. We highlight four major findings as follows:
Finding #1: The actual lifetime of all devices was much longer than that reported by the default indicators. The last two columns of
Table 5 compare the lifetime reported by the default indicators and the actual lifetime, both of which are presented using the number of workload iterations. We can see that the actual lifetime well exceeded the indicated lifetime. For example, the indicated lifetime of C2 was 3396 workload iterations, while its actual lifetime was 7013, which is 207% of the indicated one. On average, the actual lifetime was 171% of the indicated lifetime, which implies that the default indicators were generally very conservative and may waste a significant amount of device lifespan.
Finding #2: Some indicators may report misleading values. During the workload, all indicators reported the lifetime as a decreasing positive value, which gradually reached zero. However, after reporting the zero lifetime, the indicators of two devices (i.e., C1, C2) overflowed to a garbage value, which could mislead users about the device lifetime.
Finding #3: When devices eventually failed, they failed in different manners and caused different amounts of data loss. As shown in the “Fail” column, four devices (i.e., A1, A2, B2, D2) ended with Device Not Found (DNF), three devices (i.e., B1, C1, C2) corrupted user data (DC), and one device (D1) became Read Only (RO). Even the devices from the same manufacturer (e.g., B1 and B2) may behave differently (e.g., DC versus DNF). Among all the devices, only D1 was able to fail gracefully without causing any data loss.
Finding #4: The internal status of the devices varied greatly. The fourth–ninth columns (i.e., “PEr” to “Jitter”) show the final values of the parameters monitored by our framework, which revealed the internal status of the devices. We can see that these values were not consistent even within devices from the same manufacturer (e.g., B1’s PEr was 207, while B2’s PEr was 46). Moreover, some devices (e.g., D1 and D2) did not report all parameters. We analyze the trends of these parameters with more details in the next section.
5.2. Analysis of Individual Parameters
To understand the device behavior in more detail, we analyzed the trend of each parameter during the whole lifetime of the devices. We summarize our findings on PEr, EEr, CEr, and jitter (performance jitter) in this section as they are most prevalent among the devices.
Finding #5: Program and erase errors were prevalent, and tended to occur in batches towards the end of device lifetime. Program and erase errors were observed in all devices except for D1 and D2, which did not disclose related attributes. Moreover, we found that the errors tended to occur in batches when the devices were reaching the end of their lifetime. Two examples are shown in
Figure 4, which reveals the trend of PEr and EEr during the whole lifetime of A1 and A2. We can see that immediately after the 3188
th workload iteration, the PEr of A1 suddenly increased from zero to more than 20. We defined this behavior as
surging, where a batch of errors occurs in one or more consecutive workload iterations.
Several reasons may contribute to this behavior. As mentioned in
Section 2, flash cells wear out gradually as more electrons are “stuck” in the oxide layer. Therefore, the program and erase errors are unlikely to happen at the early stage of the device lifetime. Furthermore, for performance reasons, the FTL may apply wear leveling and other algorithms in groups of blocks, which means the blocks within the same group (and the pages within in the same block) tend to have a similar usage rate. As a result, when an erase or program error occurs, a retry on the neighboring block or page is likely to generate a similar error.
Finding #6: Correctable errors appeared most, and they exhibited both similarity and differences among all devices. All drives reported correctable errors, and we have observed a large amount of CErs on every device.
Figure 5A–D show the trend of CEr on the eight devices. On the one hand, we can see that the CEr increased slowly in the early stage of all devices. For example, C1 and C2 in
Figure 5C had less than 10% of errors in the first 40% of their lifetime. Furthermore, there was
surging on all devices where the number of errors appeared in batches.
On the other hand, the timing, as well as the amount of errors in the
surging period may be different even between the devices with the same capacity from the same manufacturer. For example, in
Figure 5C, C2’s
surging period started at its 40% lifetime and reached about 0.7 around its 50% lifetime, while C1’s first
surging started after its 50% lifetime and reached less than 0.4 at its 60% lifetime.
Finding #7: CRC errors were not directly related to device failures. Two drives, A2 and D1, experienced two occurrences of CRC errors as shown in the CRC column of
Table 5 and illustrated with mark
C in
Figure 5A and
Figure 5D respectively. Both were manifested in the early stage of device lifespan, and no other abnormal status was observed after the CRC errors, which implies that this type of error does not directly contribute to device failures. This is because CRC errors are usually caused by the unstable communication channels between the host and the device and may be resolved simply by a retry.
Finding #8: Uncorrectable errors always led to device failures. Experiments on C1 and C2 were terminated by the checking procedure of our framework with bit corruptions observed in our records. Immediately before observing the corruptions, there were uncorrectable errors detected as illustrated with mark
U in
Figure 5C. Therefore, a non-zero uncorrectable error count is a strong indication of an imminent device failure.
Finding #9: Performance may not necessarily decrease when the device reaches the end of its lifetime. As shown in the “Jitter” column of
Table 5, we observed different performance slowdown on four devices (i.e., C1, C2, D1, D2) at the end of their lifetime. The jitters can last as long as two workload iterations, and the slowdown for different sizes of writes within each workload iteration was different, which are summarized in
Table 6. Furthermore, we mark the jitters that occurred on C1, C2, D1, and D2 as the arrows in
Figure 5C,D. We can see that all jitters occurred after 70% of the device lifetime. One possible reason may be that at the late stage of device lifetime, there are many internal errors, which leads to frequent retry and re-allocation and hurts performance.
On the other hand, however, half of the eight devices (i.e., A1, A2, B1, B2) did not exhibit any performance slowdown throughout the whole lifespan. This suggests that estimating the lifetime based on the performance may not work for some devices.
5.4. Analysis of the External Environment
Finding #11: Higher temperature can impact the drive lifetime under memory-oriented workloads. Temperature is known to be influential towards SSDs’ ability for longtime data retention [
24,
25]. Through the test, we found out that higher temperature can have certain impacts on the drives’ remaining lifetime under workloads of short-lived data. While we tried our best to simulate the field environment of data centers, the SSDs still suffered from temperature variation, which may be because of the transient difference of workloads, air flow, and FTL internal thermal throttling. As shown in
Table 7, we demonstrated the distribution of the working temperature of different test devices. We may observe that, for SSDs having higher percentages of
hot and
extreme temperatures (namely A2, B1, C1, and D1), they all had shorter lifetimes, as shown in
Table 5. This statistically verifies the impacts led by temperature variation. Thus, the external environment (e.g., temperature) should be included when designing a more accurate life indicator.