You need to log in to create posts and topics. Login · Register

Writecache behaviour is erratic

Pages: 1 2

Thank-you for your reply 🙂

this is the output from uname -r: 5.14.21-04-petasan

Until recently, the cache devices were quite small (about 10GB partition for each OSD), so I thought it was more or less normal to have them constantly near 100% occupation. The cluster worked, write operations actually went almost always directly to the HDDs, sync write performance was pretty bad, but we didn't care much because we used the cluster mostly to store backup data.

Then the hardware was upgraded, and new cache SSDs have been installed, so that now every OSD has a 50GB cache partition and it could be increased further. But even after the upgrade, write performance was still sloppy, so I've started investigating the issue and I found out that the cache volumes have this bizarre behaviour.

Running a flush command manually is something I do from time to time, and of course it makes 100% of the cache blocks free again. But then usage increases again, and when the high_watermark is reached sometimes flushing occurs, but most of the time it does not.

Is there anything I could do in order to trace what happens within the writecache module?

Thank-you,

Hello, I finally found out exactly what is going on with writecache.

Short summary: even when „dmsetup“ shows „pause_writeback 0“, this parameter is actually non-zero if nobody sets it to an explicit value. And for this reason, writecache will suspend writeback for a while whenever it notices some I/O activity to the slow device. In my case, this was enough to make the fast device go over the high_watermark, and often reach 100%.

Line 33 of drivers/md/dm-writecache.c in the kernel source has:

#define PAUSE_WRITEBACK (HZ * 3)

This non-zero value is applied by default, even if the „table“ command later shows it to be zero (I would say this is a bug in the kernel module - two different variables are used, one for display and one for the real logic).

I could restore the expected normal behaviour by running this command on each node:

vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0

Maybe other users never run into this issue because they have faster devices/processor, I have no idea. In any case, we should be aware that writecache gives misleading information in this regard.

Even better, Petasan could add pause_writeback to the set of cache parameters which get tuned by /opt/petasan/config/tuning/current/writecache, so it will always have a predictible value.

Thank-you for your support.

Thank you very much for this detailed feedback 🙂
As suggested, it is better/safer for us to always set pause_writeback 0 in our tune script.

Looking at the original kernel commits that added this feature, it seems the idea was if your cache partition gets full and all your client writes will have to go direcly to slow device, the idea is to pause all backgound flushes so not to impact client writes, and only resumes after pause_writeback (3sec) from last client write, it could pause forever if client writes persist.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/md/dm-writecache.c?h=v6.8&id=95b88f4d71cb953e02206be3c757083601391a0f

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/md/dm-writecache.c?h=v6.8&id=5c0de3d72f8c05678ed769bea24e98128f7ab570

Maybe the issue you saw was due to hitting a cache full situation or maybe there is another bug associated with above feature, i would not doubt the later as looking at the added code, its integration and logic it may have flaws.
In the longer term we would run more tests to replicate this condition and if so, will report it to kernel upstream or suggest fix.

Thanks again!

Hi all,

we had exactly the same situation on our PetaSAN cluster. Applying the following command on each node resulted in clearing the cache and solving our issue:

vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0

We saw that after rebooting a node, the cache is again getting full. Is there a way to make it persistent?

Thanks!

kr,

Robin

 

Hello, in order to make the change persistent you can for example create a service definition of your own under /lib/systemd/system, using something like:

[Unit]
Description=Custom Petasan tuning
After=ceph.target

[Service]
Type=simple
ExecStart=/root/reset-writecache-pause
Restart=on-failure
RestartSec=60
RemainAfterExit=yes

[Install]
WantedBy=ceph.target

You can save that for example as custom-tuning.service, then use systemctl daemon-reload to update the config.

And the content of /root/reset-writecache-pause: (the file should be executable)

#!/bin/bash

vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0

Pages: 1 2