Backfill speed settings have no effect on Petasan 3.2.1
dbutti
28 Posts
March 5, 2024, 11:59 amQuote from dbutti on March 5, 2024, 11:59 amAfter a bit of struggling, I have found out that all the settings in the Petasan GUI related to Backfill speed are IGNORED in the current version.
Petasan 3.2.1 is based on Ceph 17.2.5 (Quincy), which has introduced a whole new set of knobs (mClock scheduler) that must be used to control backfill and recovery processes: https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/
As a result, while still trying to throttle backfill with the GUI options, I have suffered multiple OSD failures on a 20x drive, 4x hosts cluster, when a ruleset update sent 20TB of data moving across spinning disks. In the absence of mClock tuning, PetaSAN is using standard values with essentially no limitation and priority for normal client I/O operations, which can easily overwhelm larger and slower rotating disks.
If throttling of the recovery/backfill processes is needed, one can currently use something like this on the CLI:
ceph config set osd osd_mclock_profile custom
ceph config set osd osd_mclock_scheduler_client_wgt 4
ceph config set osd osd_mclock_scheduler_background_recovery_lim 100
ceph config set osd osd_mclock_scheduler_background_recovery_res 20
In the Quincy version used by Petasan, the "recovery_lim" and "recovery_res" values are expressed in IOPS (they can be set to different values for different OSD instances if needed). But please beware, after Ceph 17.2.7 or 18.2.0 they will become a float between 0.0 and 1.0, meaning the ratio of total IOPSs measured for each OSD in perf tuning.
This page from Proxmox support has the full story: https://pve.proxmox.com/wiki/Ceph_mClock_Tuning
I believe the GUI controls must be updated to use this mechanism, and until this happens, maybe as a quick fix a warning message could be added to inform the administrator that the parameters have no effect.
Thank-you,
After a bit of struggling, I have found out that all the settings in the Petasan GUI related to Backfill speed are IGNORED in the current version.
Petasan 3.2.1 is based on Ceph 17.2.5 (Quincy), which has introduced a whole new set of knobs (mClock scheduler) that must be used to control backfill and recovery processes: https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/
As a result, while still trying to throttle backfill with the GUI options, I have suffered multiple OSD failures on a 20x drive, 4x hosts cluster, when a ruleset update sent 20TB of data moving across spinning disks. In the absence of mClock tuning, PetaSAN is using standard values with essentially no limitation and priority for normal client I/O operations, which can easily overwhelm larger and slower rotating disks.
If throttling of the recovery/backfill processes is needed, one can currently use something like this on the CLI:
ceph config set osd osd_mclock_profile custom
ceph config set osd osd_mclock_scheduler_client_wgt 4
ceph config set osd osd_mclock_scheduler_background_recovery_lim 100
ceph config set osd osd_mclock_scheduler_background_recovery_res 20
In the Quincy version used by Petasan, the "recovery_lim" and "recovery_res" values are expressed in IOPS (they can be set to different values for different OSD instances if needed). But please beware, after Ceph 17.2.7 or 18.2.0 they will become a float between 0.0 and 1.0, meaning the ratio of total IOPSs measured for each OSD in perf tuning.
This page from Proxmox support has the full story: https://pve.proxmox.com/wiki/Ceph_mClock_Tuning
I believe the GUI controls must be updated to use this mechanism, and until this happens, maybe as a quick fix a warning message could be added to inform the administrator that the parameters have no effect.
Thank-you,
admin
2,918 Posts
March 5, 2024, 2:49 pmQuote from admin on March 5, 2024, 2:49 pmYes there was a bug in mclock scheduler
For PetaSAN clusters built with 3.2.X, the bug will not show
For PetaSAN clusters that were initially built before 3.2.X and upgraded, it will show the issue of backfill speeds not effective.
The fix is to set
ceph config set osd osd_op_queue wpq
then on each node restart the OSDs
systemctl restart ceph-osd@.target
The update fix will be included in 3.3 upgrade which should be out within days.
Earlier update script incorrectly ran
ceph config set osd.* osd_op_queue wpq
instead of
ceph config set osd osd_op_queue wpq
it should have worked, but due to a different bug, the first syntax does not work.
Once the mclock scheduler is more stable, we would switch to it from wpq
Yes there was a bug in mclock scheduler
For PetaSAN clusters built with 3.2.X, the bug will not show
For PetaSAN clusters that were initially built before 3.2.X and upgraded, it will show the issue of backfill speeds not effective.
The fix is to set
ceph config set osd osd_op_queue wpq
then on each node restart the OSDs
systemctl restart ceph-osd@.target
The update fix will be included in 3.3 upgrade which should be out within days.
Earlier update script incorrectly ran
ceph config set osd.* osd_op_queue wpq
instead of
ceph config set osd osd_op_queue wpq
it should have worked, but due to a different bug, the first syntax does not work.
Once the mclock scheduler is more stable, we would switch to it from wpq
Last edited on March 5, 2024, 2:50 pm by admin · #2
dbutti
28 Posts
March 5, 2024, 3:03 pmQuote from dbutti on March 5, 2024, 3:03 pmHello, thank-you for your reply.
I see that the mClock scheduler has far more options, and it's based on individual measurements of IOPS capacity for each OSD drive, which makes it probably a better choice for clusters using mixed hardware types.
Will watch out for the conf change when 3.3 comes out, possibly resetting it back to mclock on the CLI if finer control is needed.
It would be nice if could add a release note about this change, so users are warned and don't get caught by surprise. This is not the first time when a misalignment in the configuration appears because cluster get updated from previous versions (as opposed to being installed afresh), but this is always going the way it works with real production clusters, so it is crucial to know exactly how the software will behave after an upgrade.
Hello, thank-you for your reply.
I see that the mClock scheduler has far more options, and it's based on individual measurements of IOPS capacity for each OSD drive, which makes it probably a better choice for clusters using mixed hardware types.
Will watch out for the conf change when 3.3 comes out, possibly resetting it back to mclock on the CLI if finer control is needed.
It would be nice if could add a release note about this change, so users are warned and don't get caught by surprise. This is not the first time when a misalignment in the configuration appears because cluster get updated from previous versions (as opposed to being installed afresh), but this is always going the way it works with real production clusters, so it is crucial to know exactly how the software will behave after an upgrade.
Backfill speed settings have no effect on Petasan 3.2.1
dbutti
28 Posts
Quote from dbutti on March 5, 2024, 11:59 amAfter a bit of struggling, I have found out that all the settings in the Petasan GUI related to Backfill speed are IGNORED in the current version.
Petasan 3.2.1 is based on Ceph 17.2.5 (Quincy), which has introduced a whole new set of knobs (mClock scheduler) that must be used to control backfill and recovery processes: https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/
As a result, while still trying to throttle backfill with the GUI options, I have suffered multiple OSD failures on a 20x drive, 4x hosts cluster, when a ruleset update sent 20TB of data moving across spinning disks. In the absence of mClock tuning, PetaSAN is using standard values with essentially no limitation and priority for normal client I/O operations, which can easily overwhelm larger and slower rotating disks.
If throttling of the recovery/backfill processes is needed, one can currently use something like this on the CLI:
ceph config set osd osd_mclock_profile custom
ceph config set osd osd_mclock_scheduler_client_wgt 4
ceph config set osd osd_mclock_scheduler_background_recovery_lim 100
ceph config set osd osd_mclock_scheduler_background_recovery_res 20In the Quincy version used by Petasan, the "recovery_lim" and "recovery_res" values are expressed in IOPS (they can be set to different values for different OSD instances if needed). But please beware, after Ceph 17.2.7 or 18.2.0 they will become a float between 0.0 and 1.0, meaning the ratio of total IOPSs measured for each OSD in perf tuning.
This page from Proxmox support has the full story: https://pve.proxmox.com/wiki/Ceph_mClock_Tuning
I believe the GUI controls must be updated to use this mechanism, and until this happens, maybe as a quick fix a warning message could be added to inform the administrator that the parameters have no effect.
Thank-you,
After a bit of struggling, I have found out that all the settings in the Petasan GUI related to Backfill speed are IGNORED in the current version.
Petasan 3.2.1 is based on Ceph 17.2.5 (Quincy), which has introduced a whole new set of knobs (mClock scheduler) that must be used to control backfill and recovery processes: https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/
As a result, while still trying to throttle backfill with the GUI options, I have suffered multiple OSD failures on a 20x drive, 4x hosts cluster, when a ruleset update sent 20TB of data moving across spinning disks. In the absence of mClock tuning, PetaSAN is using standard values with essentially no limitation and priority for normal client I/O operations, which can easily overwhelm larger and slower rotating disks.
If throttling of the recovery/backfill processes is needed, one can currently use something like this on the CLI:
ceph config set osd osd_mclock_profile custom
ceph config set osd osd_mclock_scheduler_client_wgt 4
ceph config set osd osd_mclock_scheduler_background_recovery_lim 100
ceph config set osd osd_mclock_scheduler_background_recovery_res 20
In the Quincy version used by Petasan, the "recovery_lim" and "recovery_res" values are expressed in IOPS (they can be set to different values for different OSD instances if needed). But please beware, after Ceph 17.2.7 or 18.2.0 they will become a float between 0.0 and 1.0, meaning the ratio of total IOPSs measured for each OSD in perf tuning.
This page from Proxmox support has the full story: https://pve.proxmox.com/wiki/Ceph_mClock_Tuning
I believe the GUI controls must be updated to use this mechanism, and until this happens, maybe as a quick fix a warning message could be added to inform the administrator that the parameters have no effect.
Thank-you,
admin
2,918 Posts
Quote from admin on March 5, 2024, 2:49 pmYes there was a bug in mclock scheduler
For PetaSAN clusters built with 3.2.X, the bug will not show
For PetaSAN clusters that were initially built before 3.2.X and upgraded, it will show the issue of backfill speeds not effective.
The fix is to set
ceph config set osd osd_op_queue wpq
then on each node restart the OSDs
systemctl restart ceph-osd@.targetThe update fix will be included in 3.3 upgrade which should be out within days.
Earlier update script incorrectly ran
ceph config set osd.* osd_op_queue wpq
instead of
ceph config set osd osd_op_queue wpqit should have worked, but due to a different bug, the first syntax does not work.
Once the mclock scheduler is more stable, we would switch to it from wpq
Yes there was a bug in mclock scheduler
For PetaSAN clusters built with 3.2.X, the bug will not show
For PetaSAN clusters that were initially built before 3.2.X and upgraded, it will show the issue of backfill speeds not effective.
The fix is to set
ceph config set osd osd_op_queue wpq
then on each node restart the OSDs
systemctl restart ceph-osd@.target
The update fix will be included in 3.3 upgrade which should be out within days.
Earlier update script incorrectly ran
ceph config set osd.* osd_op_queue wpq
instead of
ceph config set osd osd_op_queue wpq
it should have worked, but due to a different bug, the first syntax does not work.
Once the mclock scheduler is more stable, we would switch to it from wpq
dbutti
28 Posts
Quote from dbutti on March 5, 2024, 3:03 pmHello, thank-you for your reply.
I see that the mClock scheduler has far more options, and it's based on individual measurements of IOPS capacity for each OSD drive, which makes it probably a better choice for clusters using mixed hardware types.
Will watch out for the conf change when 3.3 comes out, possibly resetting it back to mclock on the CLI if finer control is needed.
It would be nice if could add a release note about this change, so users are warned and don't get caught by surprise. This is not the first time when a misalignment in the configuration appears because cluster get updated from previous versions (as opposed to being installed afresh), but this is always going the way it works with real production clusters, so it is crucial to know exactly how the software will behave after an upgrade.
Hello, thank-you for your reply.
I see that the mClock scheduler has far more options, and it's based on individual measurements of IOPS capacity for each OSD drive, which makes it probably a better choice for clusters using mixed hardware types.
Will watch out for the conf change when 3.3 comes out, possibly resetting it back to mclock on the CLI if finer control is needed.
It would be nice if could add a release note about this change, so users are warned and don't get caught by surprise. This is not the first time when a misalignment in the configuration appears because cluster get updated from previous versions (as opposed to being installed afresh), but this is always going the way it works with real production clusters, so it is crucial to know exactly how the software will behave after an upgrade.