Spooky harddrive

Posted on August 31, 2010 with tags tech. See the previous or next posts.

For a couple of weeks, one of the harddrives in my server has started to behave strangely. The RAID controller has started first reporting some drive timeouts (daily, when the drive runs a short or long smart test), and now it even kicks it out of the (RAID1) array.

And yet, during normal operation, there’s no read error or write error that I can trigger, and there are no reallocated sectors (according to smart). The only time when I can reproduce the error is:

  • the drive is running a SMART short or long self-test
  • a SMART query for the drive is being done (e.g. -c, -l selftest, etc.)

What happens when both these conditions are met? The the SMART query is taking ages (as in ~20-30 seconds). This might cause some such high delays that the drive itself will report a timeout error (if any I/O takes place at the same time), and log an error in its internal error log.

Another harddrive (identical brand): ~3s for SMART query during selftest and I/O load, no issues whatsoever. For this harddrive, smartctl -a reads a while, and then:

Error SMART Error Self-Test Log Read failed: Input/output error
Smartctl: SMART Self Test Log Read Failed
real    0m39.029s

The timeout above also has generated lots of errors in the drive’s error log. I don’t know how to read these properly, but in any case they don’t seem too scary:

Error 144 occurred at disk power-on lifetime: 13552 hours (564 days + 16 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  10 51 00 80 ae 39 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 80 70 80 ae 39 1c 08  41d+08:07:45.052  WRITE FPDMA QUEUED
  b0 d0 01 00 4f c2 00 08  41d+08:07:45.038  SMART READ DATA
  ec 00 01 00 00 00 00 08  41d+08:07:44.958  IDENTIFY DEVICE
  2f 00 01 10 00 00 00 08  41d+08:07:44.957  READ LOG EXT
  61 80 70 80 ae 39 1c 08  41d+08:07:37.960  WRITE FPDMA QUEUED

For some of the errors, all preceding commands are WRITE FPDMA QUEUED, but all are during a “SMART Offline or Self-test” phase.

When a self-test is not being done, reading all the SMART data (smartctl -a) is very very quick, taking half a second.

The only thing I can think of is that the drive’s own area for storing SMART data is unhealthy, and reading it takes time, and a concurrent SMART test and I/O load makes it hard for the drive to do so. But again, I can’t trigger any real I/O error, nor at the beginning of the drive neither at the end, so…

This also happens when the drives is connected to a plain SATA port, skipping the RAID controller, so it’s not just the controller playing games on me.

I’m really confused now. Given my previous experience, this drive will die, should already have died, and yet, no I/O errors, just some timeouts. Do I just need to wait a couple more weeks?