Forums - PetaSAN

ForumBug ReportingWritecache behaviour is erratic
You need to log in to create posts and topics. Login · Register
Writecache behaviour is erratic

Pages: 1 2

dbutti
28 Posts

March 14, 2024, 6:57 pm
Quote from dbutti on March 14, 2024, 6:57 pm
Hello; I'm running a Petasan cluster with 4 hosts and a total of 20 OSDs. On each host an SSD volume has been defined as cache for the (Bluestore) OSDs, which are mostly on HDDs. The OSDs are 3,7TB in size, and the cache partitions on SSD are 55GB. PetaSAN version is the latest 3.2.1.

No matter how hard I try to tweak the parameters in /opt/petasan/config/tuning/current/writecache, the cache volumes often end up 100% full, and they will never get brought back to the low watermark.

My current tuning parameters are:

{
"dm_writecache_throttle": "95",
"high_watermark": "50",
"low_watermark": "45",
"writeback_jobs": "64000"
}

Even when the system is sitting idle, with a reported activity of just 4-5MB/s reads and writes, nothing happens. As a result, the write cache is in fact ineffective for one or more rotational OSDs, and write performance/latency is unpredictable - mostly horrible.

If I run "dmsetup message /dev/<VG>/main 0 flush", flushing is initiated and it runs at good speed, thus restoring free space in the cache volume and bringing OSD write performance back on track. So this demonstrates that the disks are capable enough to sustain I/O traffic from flushing at this point in time - then why doesn't the kernel start writing to the back device on its own?

Output from "dmsetup table" shows that the parameters are in fact in place:

ps--6b45d82e--a213--4882--b00d--8697060a78b2--wc--osd.13-main: 0 31138504704 writecache s 254:5 254:4 4096 start_sector 407200 high_watermark 50 low_watermark 45 writeback_jobs 64000 autocommit_blocks 65536 autocommit_time 1500 nofua pause_writeback 0 max_age 0

autocommit_time=1500 is just an attempt I made manually through lvchange --cachesettings, but it didn't bring any improvement, either.

I've browsed through the kernel code and the setup scripts, but I cannot figure out what's wrong with this setup. Is there anything else I should be aware of?

Many thanks in advance for your support,

Hello; I'm running a Petasan cluster with 4 hosts and a total of 20 OSDs. On each host an SSD volume has been defined as cache for the (Bluestore) OSDs, which are mostly on HDDs. The OSDs are 3,7TB in size, and the cache partitions on SSD are 55GB. PetaSAN version is the latest 3.2.1.

No matter how hard I try to tweak the parameters in /opt/petasan/config/tuning/current/writecache, the cache volumes often end up 100% full, and they will never get brought back to the low watermark.

My current tuning parameters are:

{
"dm_writecache_throttle": "95",
"high_watermark": "50",
"low_watermark": "45",
"writeback_jobs": "64000"
}

Even when the system is sitting idle, with a reported activity of just 4-5MB/s reads and writes, nothing happens. As a result, the write cache is in fact ineffective for one or more rotational OSDs, and write performance/latency is unpredictable - mostly horrible.

If I run "dmsetup message /dev/<VG>/main 0 flush", flushing is initiated and it runs at good speed, thus restoring free space in the cache volume and bringing OSD write performance back on track. So this demonstrates that the disks are capable enough to sustain I/O traffic from flushing at this point in time - then why doesn't the kernel start writing to the back device on its own?

Output from "dmsetup table" shows that the parameters are in fact in place:

ps--6b45d82e--a213--4882--b00d--8697060a78b2--wc--osd.13-main: 0 31138504704 writecache s 254:5 254:4 4096 start_sector 407200 high_watermark 50 low_watermark 45 writeback_jobs 64000 autocommit_blocks 65536 autocommit_time 1500 nofua pause_writeback 0 max_age 0

autocommit_time=1500 is just an attempt I made manually through lvchange --cachesettings, but it didn't bring any improvement, either.

I've browsed through the kernel code and the setup scripts, but I cannot figure out what's wrong with this setup. Is there anything else I should be aware of?

Many thanks in advance for your support,

#1

admin
2,858 Posts

March 14, 2024, 8:33 pm
Quote from admin on March 14, 2024, 8:33 pm
try
/opt/petasan/config/tuning/current/writecache
{
"dm_writecache_throttle": "50",
"high_watermark": "50",
"low_watermark": "49",
"writeback_jobs": "512"
}

then run script
/opt/petasan/scripts/tuning/writecache_tune.py

try
/opt/petasan/config/tuning/current/writecache
{
"dm_writecache_throttle": "50",
"high_watermark": "50",
"low_watermark": "49",
"writeback_jobs": "512"
}

then run script
/opt/petasan/scripts/tuning/writecache_tune.py

#2

dbutti
28 Posts

March 14, 2024, 8:42 pm
Quote from dbutti on March 14, 2024, 8:42 pm
Those are the default values, and when I have first noticed this issue my cluster was running exactly with those settings.

I have tried to move away from them in the hope to see a better behaviour, but nothing changed.

I have reset the tuning parameters to the values you have suggested, and just as before, I see many cache volumes stuck at or near 100% occupation:

main ps-6b45d82e-a213-4882-b00d-8697060a78b2-wc-osd.10 Cwi-aoC--- <3.45t [cache_cvol] [main_wcorig] 24.14
main ps-6b45d82e-a213-4882-b00d-8697060a78b2-wc-osd.6 Cwi-aoC--- <3.80t [cache_cvol] [main_wcorig] 44.01
main ps-6b45d82e-a213-4882-b00d-8697060a78b2-wc-osd.7 Cwi-aoC--- <3.80t [cache_cvol] [main_wcorig] 99.54
main ps-6b45d82e-a213-4882-b00d-8697060a78b2-wc-osd.8 Cwi-aoC--- <3.80t [cache_cvol] [main_wcorig] 100.00

iostat shows that at the same time, almost zero traffic is flowing to the disks. Shouldn't cache occupation fall back towards 49% then?

Those are the default values, and when I have first noticed this issue my cluster was running exactly with those settings.

I have tried to move away from them in the hope to see a better behaviour, but nothing changed.

I have reset the tuning parameters to the values you have suggested, and just as before, I see many cache volumes stuck at or near 100% occupation:

main ps-6b45d82e-a213-4882-b00d-8697060a78b2-wc-osd.10 Cwi-aoC--- <3.45t [cache_cvol] [main_wcorig] 24.14
main ps-6b45d82e-a213-4882-b00d-8697060a78b2-wc-osd.6 Cwi-aoC--- <3.80t [cache_cvol] [main_wcorig] 44.01
main ps-6b45d82e-a213-4882-b00d-8697060a78b2-wc-osd.7 Cwi-aoC--- <3.80t [cache_cvol] [main_wcorig] 99.54
main ps-6b45d82e-a213-4882-b00d-8697060a78b2-wc-osd.8 Cwi-aoC--- <3.80t [cache_cvol] [main_wcorig] 100.00

iostat shows that at the same time, almost zero traffic is flowing to the disks. Shouldn't cache occupation fall back towards 49% then?

#3

admin
2,858 Posts

March 14, 2024, 10:35 pm
Quote from admin on March 14, 2024, 10:35 pm
The suggested configuration should flush more often but with less data each time, so it will not stress the slower device.

The suggested configuration should flush more often but with less data each time, so it will not stress the slower device.

#4

dbutti
28 Posts

March 15, 2024, 12:16 am
Quote from dbutti on March 15, 2024, 12:16 am
Thank-you, I understand that the suggested configuration will be less stressful to the slower disks.

But the problem remains: I’m using the suggested configuration right now, and I have several cache volumes which have reached 100% occupation and don’t go down to 49%, no matter how long I wait, even if they are idle most of the time. The only way to free them is with a “flush” message - I don’t think writecache was designed to work like this.

Do you have any advice, what could be the cause for this behaviour?

Thank-you, I understand that the suggested configuration will be less stressful to the slower disks.

But the problem remains: I’m using the suggested configuration right now, and I have several cache volumes which have reached 100% occupation and don’t go down to 49%, no matter how long I wait, even if they are idle most of the time. The only way to free them is with a “flush” message - I don’t think writecache was designed to work like this.

Do you have any advice, what could be the cause for this behaviour?

#5

admin
2,858 Posts

March 15, 2024, 4:22 am
Quote from admin on March 15, 2024, 4:22 am
it should automatically flush from fast to slow device when it reaches high watermark and stops at low watermark.

i just tested it with no issues.

can you reboot one node to clear all settings and check if still you have an issue.

when did you experience this problem, was it working before or is it always like this ?

if you run

vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup status {}/main

the last number on right is current number of flush io from fast to slow, if 0 then no current flush

the number before it is the total number of free 4k blocks in cache device, the number before it is the total number of 4k blocks. when flush completes, the free blocks should sit at low watermark percentage before rising.

it should automatically flush from fast to slow device when it reaches high watermark and stops at low watermark.

i just tested it with no issues.

can you reboot one node to clear all settings and check if still you have an issue.

when did you experience this problem, was it working before or is it always like this ?

if you run

vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup status {}/main

the last number on right is current number of flush io from fast to slow, if 0 then no current flush

the number before it is the total number of free 4k blocks in cache device, the number before it is the total number of 4k blocks. when flush completes, the free blocks should sit at low watermark percentage before rising.

#6

dbutti
28 Posts

March 15, 2024, 7:51 am
Quote from dbutti on March 15, 2024, 7:51 am
Thank-you for your answer. As far as I can tell, the cluster has been always having this issue.

It is very strange, most OSD have the “right” numbers, but then one or two OSDs will start showing a lower number of free cache blocks, and never improve.

This is for example what my node petasan1 shows right now:

root@petasan1:~# vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup status {}/main
0 8160436224 writecache 0 13131680 6704951 0
0 8160436224 writecache 0 13131680 6781300 0
0 8160436224 writecache 0 13131680 6429592 0
0 7408812032 writecache 0 13131680 5248720 0

The number slowly decreases, and it never flushes:

root@petasan1:~# vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup status {}/main
0 8160436224 writecache 0 13131680 6665640 0
0 8160436224 writecache 0 13131680 6782888 0
0 8160436224 writecache 0 13131680 6376136 0
0 7408812032 writecache 0 13131680 5180331 0

The cluster is very quiet currently, no big activity going on:

root@petasan1:~# ceph -s
cluster:
id: 6b45d82e-a213-4882-b00d-8697060a78b2
health: HEALTH_OK

services:
mon: 3 daemons, quorum petasan3,petasan1,petasan2 (age 12h)
mgr: petasan1(active, since 12h), standbys: petasan2, petasan3
mds: 1/1 daemons up, 2 standby
osd: 19 osds: 19 up (since 12h), 19 in (since 12h) rgw: 4 daemons active (4 hosts, 1 zones)

data: volumes: 1/1 healthy pools: 18 pools, 642 pgs
objects: 10.55M objects, 14 TiB usage: 42 TiB used, 56 TiB / 97 TiB avail
pgs: 642 active+clean

io: client: 4.9 MiB/s rd, 6.9 MiB/s wr, 43 op/s rd, 181 op/s wr

When I reboot a node, I see that all OSDs start in a clean state initially, with approximately 51% of the cache blocks free, but then within 30-60 minutes some device will start showing the issue.

This is what dmsetup table shows for this device:

ps--6b45d82e--a213--4882--b00d--8697060a78b2--wc--osd.9-main: 0 7408812032 writecache s 254:11 254:10 4096 start_sector 0 high_watermark 50 low_watermark 48 writeback_jobs 2000 autocommit_blocks 65536 autocommit_time 0 nofua pause_writeback 0 max_age 0

Using the recommended value of 512 for writeback_jobs, or higher values, doesn’t make any observable difference.

Is there anything that could be getting in the way of normal flushing?

Thank-you in advance!

Thank-you for your answer. As far as I can tell, the cluster has been always having this issue.

It is very strange, most OSD have the “right” numbers, but then one or two OSDs will start showing a lower number of free cache blocks, and never improve.

This is for example what my node petasan1 shows right now:

root@petasan1:~# vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup status {}/main
0 8160436224 writecache 0 13131680 6704951 0
0 8160436224 writecache 0 13131680 6781300 0
0 8160436224 writecache 0 13131680 6429592 0
0 7408812032 writecache 0 13131680 5248720 0

The number slowly decreases, and it never flushes:

root@petasan1:~# vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup status {}/main
0 8160436224 writecache 0 13131680 6665640 0
0 8160436224 writecache 0 13131680 6782888 0
0 8160436224 writecache 0 13131680 6376136 0
0 7408812032 writecache 0 13131680 5180331 0

The cluster is very quiet currently, no big activity going on:

root@petasan1:~# ceph -s
cluster:
id: 6b45d82e-a213-4882-b00d-8697060a78b2
health: HEALTH_OK

services:
mon: 3 daemons, quorum petasan3,petasan1,petasan2 (age 12h)
mgr: petasan1(active, since 12h), standbys: petasan2, petasan3
mds: 1/1 daemons up, 2 standby
osd: 19 osds: 19 up (since 12h), 19 in (since 12h) rgw: 4 daemons active (4 hosts, 1 zones)

data: volumes: 1/1 healthy pools: 18 pools, 642 pgs
objects: 10.55M objects, 14 TiB usage: 42 TiB used, 56 TiB / 97 TiB avail
pgs: 642 active+clean

io: client: 4.9 MiB/s rd, 6.9 MiB/s wr, 43 op/s rd, 181 op/s wr

When I reboot a node, I see that all OSDs start in a clean state initially, with approximately 51% of the cache blocks free, but then within 30-60 minutes some device will start showing the issue.

This is what dmsetup table shows for this device:

ps--6b45d82e--a213--4882--b00d--8697060a78b2--wc--osd.9-main: 0 7408812032 writecache s 254:11 254:10 4096 start_sector 0 high_watermark 50 low_watermark 48 writeback_jobs 2000 autocommit_blocks 65536 autocommit_time 0 nofua pause_writeback 0 max_age 0

Using the recommended value of 512 for writeback_jobs, or higher values, doesn’t make any observable difference.

Is there anything that could be getting in the way of normal flushing?

Thank-you in advance!

#7

admin
2,858 Posts

March 15, 2024, 10:11 am
Quote from admin on March 15, 2024, 10:11 am
Well things look better, automatic flush is happening, which was not the case before. there seems to be no errors and there are no pending flush jobs.

what is not clear is why the fourth cache has a slightly less value. the second number 7408812032 is lower, from my understanding this number is coming from the device mapper layer above the cache device. But i would not worry about it unless the numbers in the status keep rising.

Currently the active flush jobs is 0 ( last value on right), if you find this always at 512, it shows your incoming traffic is high compared to flush rate, you can increase it gradually to 2000 and more in writeback_jobs value in config file and re-run the tuning script.

Well things look better, automatic flush is happening, which was not the case before. there seems to be no errors and there are no pending flush jobs.

what is not clear is why the fourth cache has a slightly less value. the second number 7408812032 is lower, from my understanding this number is coming from the device mapper layer above the cache device. But i would not worry about it unless the numbers in the status keep rising.

Currently the active flush jobs is 0 ( last value on right), if you find this always at 512, it shows your incoming traffic is high compared to flush rate, you can increase it gradually to 2000 and more in writeback_jobs value in config file and re-run the tuning script.

#8

dbutti
28 Posts

March 15, 2024, 10:32 am
Quote from dbutti on March 15, 2024, 10:32 am
No, unfortunately flush is NOT happening, as I explained in my previous post, the numbers keep decreasing and that shows that cache usage is increasing and there are less and less free pages in the cache.

NO flush is happening on that cache, and it eventually gets full. At this time, the number of FREE pages has reached 4616561:

root@petasan1:~# vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup status {}/main
0 8160436224 writecache 0 13131680 6449040 0
0 8160436224 writecache 0 13131680 6821384 0
0 8160436224 writecache 0 13131680 5461695 0
0 7408812032 writecache 0 13131680 4616561 0

In my previous message, the fourth cache has a lower number (7408812032) because that particular OSD is slightly smaller (different HDD). But the issue happens randomly on all OSDs. This is another node:

root@petasan4:~# vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup status {}/main
0 31138504704 writecache 0 13004477 6979801 0
0 31138504704 writecache 0 13004477 3494671 0
0 31138504704 writecache 0 13004477 5604340 0

Here all the HDDs are equal, and if you do the math you will see that cache occupation is beyond 73% on the second device (3494671 / 13004477 = about 27% remaining free pages).

If writecache worked as expected, the number in the second-last field should progressively get higher.

BTW, I never found anything but a "0" in the last field. So it seems that writeback jobs are almost never queued - but they should, given that the high_watermark has been reached.

No, unfortunately flush is NOT happening, as I explained in my previous post, the numbers keep decreasing and that shows that cache usage is increasing and there are less and less free pages in the cache.

NO flush is happening on that cache, and it eventually gets full. At this time, the number of FREE pages has reached 4616561:

root@petasan1:~# vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup status {}/main
0 8160436224 writecache 0 13131680 6449040 0
0 8160436224 writecache 0 13131680 6821384 0
0 8160436224 writecache 0 13131680 5461695 0
0 7408812032 writecache 0 13131680 4616561 0

In my previous message, the fourth cache has a lower number (7408812032) because that particular OSD is slightly smaller (different HDD). But the issue happens randomly on all OSDs. This is another node:

root@petasan4:~# vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup status {}/main
0 31138504704 writecache 0 13004477 6979801 0
0 31138504704 writecache 0 13004477 3494671 0
0 31138504704 writecache 0 13004477 5604340 0

Here all the HDDs are equal, and if you do the math you will see that cache occupation is beyond 73% on the second device (3494671 / 13004477 = about 27% remaining free pages).

If writecache worked as expected, the number in the second-last field should progressively get higher.

BTW, I never found anything but a "0" in the last field. So it seems that writeback jobs are almost never queued - but they should, given that the high_watermark has been reached.

#9

admin
2,858 Posts

March 15, 2024, 11:11 am
Quote from admin on March 15, 2024, 11:11 am
If i understand you correctly, the last field has always been zero, the free space block always decreases which means it never flushes. This is strange as i just tried it here with no issues, also since your cluster already has data stored then it was certainly working before ? why do you believe the issue was there from the begining ? were you doing manually flush commands yourself ? can you recall when you noticed the issue, any events at that time? that my help.

can you also find runing kernel version: uname -r

If i understand you correctly, the last field has always been zero, the free space block always decreases which means it never flushes. This is strange as i just tried it here with no issues, also since your cluster already has data stored then it was certainly working before ? why do you believe the issue was there from the begining ? were you doing manually flush commands yourself ? can you recall when you noticed the issue, any events at that time? that my help.

can you also find runing kernel version: uname -r

#10

Post Reply: Writecache behaviour is erratic

Cancel

Pages: 1 2

Writecache behaviour is erratic

Meta