SSD adventures

Posted on December 1, 2014 with tags tech.

TL;DR: Oh boy. Things work now, but I’m not sure exactly what happened :(

When it all worked fine… almost

In June, I bought a new laptop, and a new SSD for it. I used that model of SSD before (Samsung 840 Evo), although not for a long time, so I wasn’t expecting anything unusual.

Since the laptop is a slower one, I installed Debian as follows: connect SSD to my workstation, do an install on it (via Virtualbox connected to the raw device), disconnect and install in the laptop. First sign of trouble was that the SSD didn’t boot reliably. I said - maybe my Virtualbox method (new method) wasn’t right - so I reinstalled on the laptop, and everything was mostly OK.

Mostly OK because rarely the laptop had issues with seeing the SSD. Maybe not cold booting, requiring a restart, or giving some ATA errors on boot (bot nothing afterwards). I didn’t use the laptop too much - I mostly use it when travelling - so I didn’t investigate further.

Trouble begins

Fast forward to two weeks ago: I was preparing to leave on a trip, so I booted the laptop (everything OK), apt-get dist-upgrade, synced my git trees, etc. Everything was fine. The next day, in the airport between two flights, the laptop doesn’t boot - doesn’t see the SSD at all. I tried rebooting a gazillion times, nothing. I was quite upset - at the hardware, and at me for not paying attention to the unreliability signs before.

Once I arrived at the destination, I opened the laptop, tried re-seating the SSD, nothing. I bought a SATA-to-USB bridge, and surprise! Boots from the first, no issues. Diagnosis A: Laptop SATA connector has issues.

I work with this SATA-to-USB bridge for a couple of days, but it was quite slow (~20MB/s), so I buy a SATA-to-USB3 cradle, which should be much faster. But… the SSD was not visible in this cradle. Not only that, but it was causing the laptop to hang in the POST screen - reliably. Turn the cradle off, the laptop passes POST, turn it on, the laptop took 2 minutes to pass the POST. OK, the cradle is broken. Connect the SSD back to the USB2 thing… not booting⁈ For about five minutes, it was like “dead”. After that, it booted and behaved normally. I didn’t know what to think, I just put it aside. So worked for the rest of the week on the USB2 bridge, with no issues (once it the SSD dropped off, but I think that was just USB being USB). So at the end of the week, diagnosis (A) still was the main contender.

On the flight back home, I worked from the plane for a good number of hours, again no issue. Laptop/SSD were fully powered off before the flight, powered with no issues, worked fine. At the end of the flight I completely shut down my laptop. Diagnosis A still on top.

Real trouble now…

After getting home and sleeping a bit, I wanted to power up the laptop just to transfer the code I wrote on the plane. But… it didn’t power up. No problem, I said, now I actually have access to running Linux machines and I can check what’s happening. And to my surprise, the SSD was behaving… erratically:

  • part of the time it was not seen at all, the block device was a generic kernel: scsi 21:0:0:0: Direct-Access USB TO I DE/SATA Device 0008 PQ: 0 ANSI: 0

  • part of the time it was seen, i.e. the block device was visible as kernel: scsi 22:0:0:0: Direct-Access Samsunp w 0 EVO 500G 0008 PQ: 0 ANSI: 0, but with lots of ATA errors

  • it took ~10 minutes before I was able to access it at all

  • access to the disk when it was visible and not erroring out was slow; e.g. a cfdisk /dev/sdb could take >5 seconds before showing the screen

The looked like this (when connected over SATA):

20:22:57 kernel: ata2.00: ATA-9: Samsung SSD 840 EVO 500GB, EXT0BB6Q, max UDMA/133
20:22:57 kernel: ata2.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
20:22:57 kernel: ata2.00: configured for UDMA/133
20:22:57 kernel: scsi 3:0:0:0: Direct-Access     ATA      Samsung SSD 840  EXT0 PQ: 0 ANSI: 5
20:22:57 kernel: sd 3:0:0:0: [sdc] 488397168 512-byte logical blocks: (250 GB/232 GiB)
20:22:57 kernel: sd 3:0:0:0: [sdc] Write Protect is off
20:22:57 kernel: sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
20:22:57 kernel: sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
20:22:57 kernel: ata2.00: exception Emask 0x0 SAct 0x2 SErr 0x400001 action 0x6 frozen
20:22:57 kernel: ata2: SError: { RecovData Handshk }
20:22:57 kernel: ata2.00: failed command: READ FPDMA QUEUED
20:22:57 kernel: ata2.00: cmd 60/08:08:68:01:00/00:00:00:00:00/40 tag 1 ncq 4096 in
20:22:57 kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
20:22:57 kernel: ata2.00: status: { DRDY }
20:22:57 kernel: ata2: hard resetting link
20:22:57 kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
20:22:57 kernel: ata2.00: failed to get NCQ Send/Recv Log Emask 0x1
20:22:57 kernel: ata2.00: failed to get NCQ Send/Recv Log Emask 0x1
20:22:57 kernel: ata2.00: configured for UDMA/133
20:22:57 kernel: ata2.00: device reported invalid CHS sector 0
20:22:57 kernel: ata2: EH complete
20:22:57 kernel: ata2.00: exception Emask 0x0 SAct 0x400000 SErr 0x400001 action 0x6
20:22:57 kernel: ata2.00: irq_stat 0x44000008
20:22:57 kernel: ata2: SError: { RecovData Handshk }
20:22:57 kernel: ata2.00: failed command: READ FPDMA QUEUED
20:22:57 kernel: ata2.00: cmd 60/08:b0:08:03:00/00:00:00:00:00/40 tag 22 ncq 4096 in
20:22:57 kernel:         res 41/84:00:08:03:00/00:00:00:00:00/00 Emask 0x410 (ATA bus error) <F>
20:22:57 kernel: ata2.00: status: { DRDY ERR }
20:22:57 kernel: ata2.00: error: { ICRC ABRT }
20:22:57 kernel: ata2: hard resetting link
20:22:57 kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
20:22:57 kernel: ata2.00: failed to get NCQ Send/Recv Log Emask 0x1
20:22:57 kernel: ata2.00: failed to get NCQ Send/Recv Log Emask 0x1
20:22:57 kernel: ata2.00: configured for UDMA/133
20:22:57 kernel: ata2: EH complete

Note that all of the messages were in short order (during boot, they show the same timestamp but I don’t think it was actually the same second).

I tried connecting the SSD over USB2 (partially working), over USB3 (not working!), and directly over SATA (initially not working). I connected another SSD I had around (same model, just smaller capacity) over all three, it worked (so the USB3 bridge was working, at least).

Diagnosing (A) was out the door now, and the situation was very clear: SSD dying/dead/almost gone. So I connected both broken SSD and empty SSD to my workstation over SATA, and started copying data (once I managed to boot with the old one being visible and working).

During the data copy, I saw that the “broken” SSD behaved erratically indeed: it was copying data off it with either ~40MB/s, ~70MB/s, and ~160MB/s. Not other speed, at least not for long time, just cycling between these three; and this is a very slow speed for this SSD model. And then I remembered that there is, for this model (Samsung 840 Evo), an advisory/firmware fix that old data gets harder to access (slower and slower), due to how TLC cells levels are read/etc. I don’t know exactly what “old” means, but since the partition table was written only once, it should be the oldest thing written, which would make it the most susceptible to the slowdown, and could explain the cfdisk slowness. So after the data copy, I tested this:

  • read the first ~1GB off the SSD: ~70MB/s
  • rewrite it
  • read it again: ~110MB/s‼

So yes, it is something related to this, probably. Diagnosis B: “just” the Samsung Evo bug. So I proceed to (try to) upgrade the firmware:

  • the firmware upgrade and data fix on the “new” SSD: firmware upgrade almost instantly, no issues at all
  • firmware upgrade on the “old” SSD: Failed to upgrade firmware (or some message like that)

Uh oh, so it’s more than this - the SSD seems to be actually broken. I don’t have much experience with SSDs, I used for many years an Intel X25-M and more recently I have a couple of Samsung 840 Pro, but I never had issues until now, so maybe the cheap Evo is just cheap and fast…

Note that during all this drama, SMART gives good info about SSD, and SMART tests are perfect (short/long). Just a number of UDMA CRC errors… so thanks for nothing.

I try to the firmware upgrade a few times, I give up. Diagnosis C: SSD is just plain broken. I’ll send the SSD for a replacement/fix, so (even though it’s encrypted), let’s try to erase it first (as I can still read/write to it).

And since just dd if=/dev/zero of=/dev/sdX is too common, let’s try a builtin (ATA) erase! After fighting with hdparm and the fact that my BIOS does indeed “security freeze” on the drives (so you can’t change the security settings nor erase the drives), and finding an article on the net that gives a few workaround, I manage to “unfreeze” it by not only live unplugging the SATA cable, but also the power cable, and plugging them back in. Time to erase.

Side-note here:

  • hdparm says: “2min for SECURITY ERASE UNIT”; this actually takes ~40 seconds
  • hdparm also says: “8min for ENHANCED SECURITY ERASE UNIT”; this takes only ~5 seconds!

So, especially given the very fast ENHANCED erase, I now believe the SSD is truly broken. Diagnosis C rules.

And finally, just confusion

After the SSD erase, reading (the zeros) from the SSD is really fast: more than 500MB/s, as expected.

I also try once more to upgrade the firmware for the Evo bug, and it works flawlessly this time.

Even rebooting the machine many times works without problems, the SSD is seen every single time. I still can’t make it work well over the USB3 bridge (but I’m not very convinced that device work well, since Linux does apply quirks when it detects it). But it doesn’t give any ATA errors this time, and it works (read and write) consistently fast (as per the Evo specs).

I still have to let it go through a long burnin with verification, but a short one works fine, so at this point I’m just very confused as what was (is?) the actual problem with this device, and why it behaves so strange…

Was the laptop broken? Clearly no. Was the SSD broken, or just impacted by the slowdown issue? Time will tell. I’ll put it through the paces, and see. But unless I replace it, I’ll keep this just for “fast cache” use cases, and not to hold real data.

Yep, computers are fun!