Virtual hardware failures

Posted on September 15, 2010 with tags tech.

Note: The title is intentionally unclear; it can be read both as “virtual failures” and “failures of virtual hardware”, and I’m not sure which one is the better for the cases below.

One of the funny things about virtualization is how virtual “hardware” fails. It often happens that my NFS-stored virtual machine images suffer an NFS glitch or something, at which point one can see in the kernel logs:

kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880018320100)
kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 00 00 38 00 00 08 00
kernel: mptscsih: ioc0: task abort: FAILED (sc=ffff880018320100)
kernel: mptscsih: ioc0: attempting target reset! (sc=ffff880018320100)
kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 00 00 38 00 00 08 00
kernel: mptscsih: ioc0: target reset: SUCCESS (sc=ffff880018320100)
kernel: mptscsih: ioc0: attempting task abort! (sc=ffff880018320100)
kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 00 00 38 00 00 08 00
kernel: mptscsih: ioc0: task abort: FAILED (sc=ffff880018320100)
kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88001eefcc00)
kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 98 01 a7 00 00 08 00
kernel: mptscsih: ioc0: task abort: FAILED (sc=ffff88001eefcc00)
kernel: mptscsih: ioc0: attempting target reset! (sc=ffff880018320100)
kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 00 00 38 00 00 08 00
kernel: mptscsih: ioc0: target reset: SUCCESS (sc=ffff880018320100)

While the NFS transient issues are probably a real bug, I have this image of a state machine inside the driver trying over and over the same error paths, in a futile attempt at fixing the “hardware”, and not locking onto/matching the real hardware behaviour.

Or, this most recent case during boot and resource probing:

kernel:  e1000: 0000:00:08.0: e1000_probe: The EEPROM Checksum Is Not Valid
kernel:  /*********************/
kernel:  Current EEPROM Checksum : 0x031a
kernel:  Calculated              : 0x031a
kernel:  Offset    Values
kernel:  ========  ======
kernel:  00000000: 08 00 27 f6 f3 9b 00 00 ff ff 00 00 00 00 00 00
kernel:  00000010: 00 00 00 00 08 44 1e 00 86 80 0e 10 86 80 40 30
kernel:  00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
kernel:  00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
kernel:  00000040: 00 00 61 70 0c 28 c8 00 c8 00 00 00 00 00 00 00
kernel:  00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 06
kernel:  00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
kernel:  00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1a 03
kernel:  Include this output when contacting your support provider.
kernel:  This is not a software error! Something bad happened to your hardware or
kernel:  EEPROM image. Ignoring this problem could result in further problems,
kernel:  possibly loss of data, corruption or system hangs!
kernel:  The MAC Address will be reset to 00:00:00:00:00:00, which is invalid
kernel:  and requires you to set the proper MAC address manually before continuing
kernel:  to enable this network device.
kernel:  Please inspect the EEPROM dump and report the issue to your hardware vendor
kernel:  or Intel Customer Support.
kernel:  /*********************/
kernel:  e1000: 0000:00:08.0: e1000_probe: Invalid MAC Address
kernel:  e1000: 0000:00:08.0: e1000_probe: (PCI:33MHz:32-bit) 00:00:00:00:00:00
kernel:  e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection

This is an emulated Intel NIC under VirtualBox. I have never seen it before, and it went away after a reboot. The most funny thing is the This is not a software error! message… well, it definitely is :), but the driver has no way to know that.

I think that these software-driven failures probably trigger some obscure error handling paths; I assume that the usual failures for real versus virtual hardware are quite different in nature, and that recovery is again different. This leads to using the wrong strategy for recovery, but fortunately nothing is really broken, just in the wrong state… Since I’m not familiar with low-level driver details, I might be wrong of course; I’m just speculating on what I think goes on inside the driver.

I’m not even mentioning virtual clock issues, as those are no longer funny…