Notes on XFS: raid reshapes, metadata size, schedulers…
Posted on February 8, 2019 with tags tech. See the previous or next posts.
How to shoot yourself in the foot…
Just a couple on notes on XFS, and a story regarding RAID reshapes.
sunit
, swidth
and raid geometry
When I started looking at improving my backup story, first thing I did was resizing my then-existing backup array from three to four disks, to allow playing with new tools before moving to larger disks. Side-note: yes, I now have an array where the newest drive is 7 years newer than the oldest ☺
Reshaping worked well as usual (thanks md
/mdadm
!), and I thought
all is good. I forgot however that xfs takes its stripe width from the
geometry at file-system creation, and any subsequent reshapes are
ignored. This is documented (and somewhat logical, if not entirely),
in the xfs(5)
man-page:
sunit=value and swidth=value … Typically the only time these mount options are necessary if after an underly‐ ing RAID device has had it’s geometry modified, such as adding a new disk to a RAID5 lun and reshaping it.
But I forgot to do that. On top of a raid5. Which means, instead of writing full stripes, xfs was writing ⅔ of a stripe, resulting in lots of read-modify-write, which explains why I saw some unexpected performance issues even in single-threaded workloads.
Lesson learned…
xfs metadata size
Once I got my new large HDDs, I created the new large (for me, at least) file-system with all bells and whistles, which includes lots of metadata. And by lots, I really mean lots:
# mkfs.xfs -m rmapbt=1,reflink=1 /dev/mapper/…
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ 33T 643G 33T 2% /…
Almost 650G of metadata! 650G!!!!! Let’s try some variations, on a 50T file-system (on top of sparse file, not that I have this laying around!):
Plain mkfs.xfs:
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 50T 52G 50T 1% /mnt
mkfs.xfs with rmapbt=1:
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 50T 675G 50T 2% /mnt
mkfs.xfs with rmapbt=1,reflink=1:
Filesystem Size Used Avail Use% Mounted on
/dev/loop0 50T 981G 50T 2% /mnt
So indeed, the extra “bells” do eat a lot of disk. Fortunately that
~1T of metadata is not written at mkfs
time, otherwise it’d be fun!
The actual allocation for that sparse file was “just” 2G. I wonder how
come the metadata is consistent if mkfs
doesn’t ensure zeroing?
Not sure I’ll actually test the reflink support soon, but since I won’t be able to recreate the file-system easily once I put lots of data on it, I’ll leave it as such. It’s, after all, only 2% of the disk space, even if the absolute value is large.
XFS schedulers
Apparently CFQ is not a good scheduler for XFS. I’m probably the last person on earth to learn this, as it was documented for ages, but better late than never. Yet another explanation for the bad performance I was seeing…
Summary: yes, I have fun playing a sysadmin at home ☺