Server upgrades and monitoring

Posted on October 17, 2015 with tags . See the previous or next posts.

Undecided whether the title should be “exercises in Yak shaving” or “paying back technical debt” or “too much complexity for personal systems”. Anyway…

I started hosting my personal website and some other small stuff on a dedicated box (rented from a provider) in early 2008. Even for a relatively cheap box, it worked without issues for a good number of years. A surprising number of years, actually; the only issue was a power supply failure that was solved by the provider automatically and then nothing for many years. Even the harddrive (mechanical) had no issues at all for 7 years (Power_On_Hours: 64380; I probably got it after it had a few months of uptime). I believe it was the longest running harddrive I’ve ever used (for the record: Seagate Barracuda 7200.10, ST3250310AS).

The reason I delayed upgrade for a long time was twofold: first, at the same provider I couldn’t get a similar SLA for the same amount of money. I could get better hardware, but with worse SLA and options. This is easily solvable, of course, by just finding a different provider.

The other issue was that I never bothered to setup proper configuration management for the host; after all, it was only supposed to run Apache with ikiwiki and some other trivial small other things. The truth was that over time it started pilling up more and more “small things”… so actually changing the host is expensive.

As the age of the server neared 7 years, I thought to combine upgrade from Wheezy to Jessie with a HW upgrade. Managed to find a different provider that had my desired SLA and HW configuration, got the server and the only thing left was to do the migration.

Previous OS upgrades were simple as they were on the same host; i.e. rely on Debian’s reliable upgrade and nothing else to, eventually adjust slightly some configs. With a cross-host upgrade (I couldn’t just copy the old OS since it was also a 32-to-64 bit change) it’s much worse: since there’s no previous installation, I had to manually check and port the old configuration for each individual service. This got very tedious, and I realised I have to make it somehow better.

“Proper” configuration management aside, I thought that I need proper monitoring first. I already had (for a long while actually) graphing via Munin, but no actual monitoring. Since the host only had few services, this was again supposed to be easy - same mistake again.

The problem is that once you have any monitoring system setup, it’s very easy to actually add “just one more” host or service to it. First it was only the external box, then it was my firewall, then it was the rest of my home network. Then it was the cloud services that I use—for example, checking whether my domain registrar’s nameservers still are authoritative for my domain or whether the expiration date it still far in the future. And so on…

In the end, what was in previous iterations (e.g. Squeeze to Wheezy upgrade) a half-weekend job only, spread out over many weekends (interleaved with other activities, not fully working on it). I had to keep the old machine running for a month more in order to make sure everything was up and running, and I ended up with 80 services monitored across multiple systems; the migrated machine itself has almost half of these. Some of these are light items (e.g. a checking that a single vhost responds) other are aggregates. I still need to add some more checks though, especially more complex (end-to-end) ones.

The lesson I learned in all this is that, with or without configuration management in place, having monitoring makes it much easier do to host or service moves, as you know much better when everything is done whether it’s “done-done” or just “almost done”.

The question that remains though: with 80 services for a home network plus external systems (personal use); I’m not sure if I’m doing things right (monitor the stuff I need) or wrong (do I really need these many things)?