Posted on August 31, 2010 with tags tech. See the previous or next posts.
For a couple of weeks, one of the harddrives in my server has started to behave strangely. The RAID controller has started first reporting some drive timeouts (daily, when the drive runs a short or long smart test), and now it even kicks it out of the (RAID1) array.
And yet, during normal operation, there’s no read error or write error that I can trigger, and there are no reallocated sectors (according to smart). The only time when I can reproduce the error is:
- the drive is running a SMART short or long self-test
- a SMART query for the drive is being done (e.g.
-l selftest, etc.)
What happens when both these conditions are met? The the SMART query is taking ages (as in ~20-30 seconds). This might cause some such high delays that the drive itself will report a timeout error (if any I/O takes place at the same time), and log an error in its internal error log.
Another harddrive (identical brand): ~3s for SMART query during
selftest and I/O load, no issues whatsoever. For this harddrive,
smartctl -a reads a while, and then:
Error SMART Error Self-Test Log Read failed: Input/output error Smartctl: SMART Self Test Log Read Failed … real 0m39.029s
The timeout above also has generated lots of errors in the drive’s error log. I don’t know how to read these properly, but in any case they don’t seem too scary:
Error 144 occurred at disk power-on lifetime: 13552 hours (564 days + 16 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 10 51 00 80 ae 39 40 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 80 70 80 ae 39 1c 08 41d+08:07:45.052 WRITE FPDMA QUEUED b0 d0 01 00 4f c2 00 08 41d+08:07:45.038 SMART READ DATA ec 00 01 00 00 00 00 08 41d+08:07:44.958 IDENTIFY DEVICE 2f 00 01 10 00 00 00 08 41d+08:07:44.957 READ LOG EXT 61 80 70 80 ae 39 1c 08 41d+08:07:37.960 WRITE FPDMA QUEUED
For some of the errors, all preceding commands are
WRITE FPDMA QUEUED, but all are during a “SMART Offline or Self-test” phase.
When a self-test is not being done, reading all the SMART data
smartctl -a) is very very quick, taking half a second.
The only thing I can think of is that the drive’s own area for storing SMART data is unhealthy, and reading it takes time, and a concurrent SMART test and I/O load makes it hard for the drive to do so. But again, I can’t trigger any real I/O error, nor at the beginning of the drive neither at the end, so…
This also happens when the drives is connected to a plain SATA port, skipping the RAID controller, so it’s not just the controller playing games on me.
I’m really confused now. Given my previous experience, this drive will die, should already have died, and yet, no I/O errors, just some timeouts. Do I just need to wait a couple more weeks?