Notes on XFS: raid reshapes, metadata size, schedulers…

Posted on February 8, 2019 with tags tech.

Just a couple on notes on XFS, and a story regarding RAID reshapes.

sunit, swidth and raid geometry

When I started looking at improving my backup story, first thing I did was resizing my then-existing backup array from three to four disks, to allow playing with new tools before moving to larger disks. Side-note: yes, I now have an array where the newest drive is 7 years newer than the oldest ☺

Reshaping worked well as usual (thanks md/mdadm!), and I thought all is good. I forgot however that xfs takes its stripe width from the geometry at file-system creation, and any subsequent reshapes are ignored. This is documented (and somewhat logical, if not entirely), in the xfs(5) man-page:

sunit=value and swidth=value … Typically the only time these mount options are necessary if after an underly‐ ing RAID device has had it’s geometry modified, such as adding a new disk to a RAID5 lun and reshaping it.

But I forgot to do that. On top of a raid5. Which means, instead of writing full stripes, xfs was writing ⅔ of a stripe, resulting in lots of read-modify-write, which explains why I saw some unexpected performance issues even in single-threaded workloads.

Lesson learned…

xfs metadata size

Once I got my new large HDDs, I created the new large (for me, at least) file-system with all bells and whistles, which includes lots of metadata. And by lots, I really mean lots:

# mkfs.xfs -m rmapbt=1,reflink=1 /dev/mapper/…
# df -h 
Filesystem         Size  Used Avail Use% Mounted on
/dev/mapper/        33T  643G   33T   2% /…

Almost 650G of metadata! 650G!!!!! Let’s try some variations, on a 50T file-system (on top of sparse file, not that I have this laying around!):

Plain mkfs.xfs:
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0       50T   52G   50T   1% /mnt
mkfs.xfs with rmapbt=1:
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0       50T  675G   50T   2% /mnt
mkfs.xfs with rmapbt=1,reflink=1:
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0       50T  981G   50T   2% /mnt

So indeed, the extra “bells” do eat a lot of disk. Fortunately that ~1T of metadata is not written at mkfs time, otherwise it’d be fun! The actual allocation for that sparse file was “just” 2G. I wonder how come the metadata is consistent if mkfs doesn’t ensure zeroing?

Not sure I’ll actually test the reflink support soon, but since I won’t be able to recreate the file-system easily once I put lots of data on it, I’ll leave it as such. It’s, after all, only 2% of the disk space, even if the absolute value is large.

XFS schedulers

Apparently CFQ is not a good scheduler for XFS. I’m probably the last person on earth to learn this, as it was documented for ages, but better late than never. Yet another explanation for the bad performance I was seeing…

Summary: yes, I have fun playing a sysadmin at home ☺