Petasan Cluster issue - Module 'devicehealth' has failed: unknown operation

nocstaff@urbancom.net
11 Posts
June 2, 2025, 6:30 pmQuote from nocstaff@urbancom.net on June 2, 2025, 6:30 pmI am running a Proxmox cluster with 4 Petasan nodes clustered for storage. Over the weekend 23 of my 32 drives are showing offline. THis is what I get when I check the status. How can I fix this?
root@uccsan01:~# ceph status
cluster:
id: 9774be8a-c28d-4b1a-b614-e99a0151d001
health: HEALTH_ERR
Module 'devicehealth' has failed: unknown operation
Reduced data availability: 100 pgs inactive, 59 pgs down, 11 pgs stale
Degraded data redundancy: 4961154/14820801 objects degraded (33.474%), 63 pgs degraded, 63 pgs undersized
2 slow ops, oldest one blocked for 38 sec, osd.5 has slow ops
services:
mon: 3 daemons, quorum uccsan03,uccsan01,uccsan02 (age 9h)
mgr: uccsan02(active, since 4M), standbys: uccsan03, uccsan01
osd: 32 osds: 9 up (since 2h), 9 in (since 2h); 64 remapped pgs
data:
pools: 2 pools, 129 pgs
objects: 4.94M objects, 19 TiB
usage: 34 TiB used, 98 TiB / 132 TiB avail
pgs: 77.519% pgs not active
4961154/14820801 objects degraded (33.474%)
461988/14820801 objects misplaced (3.117%)
46 down
39 undersized+degraded+remapped+backfill_wait+peered
21 active+undersized+degraded+remapped+backfill_wait
11 stale+down
7 active+clean
2 undersized+degraded+remapped+backfilling+peered
2 down+remapped
1 active+undersized+degraded
io:
recovery: 48 MiB/s, 11 objects/s
progress:
Global Recovery Event (2d)
[=...........................] (remaining: 5w)
I am running a Proxmox cluster with 4 Petasan nodes clustered for storage. Over the weekend 23 of my 32 drives are showing offline. THis is what I get when I check the status. How can I fix this?
root@uccsan01:~# ceph status
cluster:
id: 9774be8a-c28d-4b1a-b614-e99a0151d001
health: HEALTH_ERR
Module 'devicehealth' has failed: unknown operation
Reduced data availability: 100 pgs inactive, 59 pgs down, 11 pgs stale
Degraded data redundancy: 4961154/14820801 objects degraded (33.474%), 63 pgs degraded, 63 pgs undersized
2 slow ops, oldest one blocked for 38 sec, osd.5 has slow ops
services:
mon: 3 daemons, quorum uccsan03,uccsan01,uccsan02 (age 9h)
mgr: uccsan02(active, since 4M), standbys: uccsan03, uccsan01
osd: 32 osds: 9 up (since 2h), 9 in (since 2h); 64 remapped pgs
data:
pools: 2 pools, 129 pgs
objects: 4.94M objects, 19 TiB
usage: 34 TiB used, 98 TiB / 132 TiB avail
pgs: 77.519% pgs not active
4961154/14820801 objects degraded (33.474%)
461988/14820801 objects misplaced (3.117%)
46 down
39 undersized+degraded+remapped+backfill_wait+peered
21 active+undersized+degraded+remapped+backfill_wait
11 stale+down
7 active+clean
2 undersized+degraded+remapped+backfilling+peered
2 down+remapped
1 active+undersized+degraded
io:
recovery: 48 MiB/s, 11 objects/s
progress:
Global Recovery Event (2d)
[=...........................] (remaining: 5w)

nocstaff@urbancom.net
11 Posts
June 2, 2025, 6:33 pmQuote from nocstaff@urbancom.net on June 2, 2025, 6:33 pmI am running version 3.3
I am running version 3.3

admin
2,980 Posts
June 3, 2025, 6:37 pmQuote from admin on June 3, 2025, 6:37 pmThe issue is not related to "Module 'devicehealth' has failed: unknown operation". You can ignore this error.
The problem is that OSDs are down, the primary focus is to get them up. There could be many problems that would lead to this including hardware/network, i can see from the status that even the monitors had a service up in last 3 hours, so it may not be just the OSDs. First try to reboot the cluster, else you will need to look at logs in /var/log/ceph, /var/log/syslog, dmesg.
The issue is not related to "Module 'devicehealth' has failed: unknown operation". You can ignore this error.
The problem is that OSDs are down, the primary focus is to get them up. There could be many problems that would lead to this including hardware/network, i can see from the status that even the monitors had a service up in last 3 hours, so it may not be just the OSDs. First try to reboot the cluster, else you will need to look at logs in /var/log/ceph, /var/log/syslog, dmesg.
Petasan Cluster issue - Module 'devicehealth' has failed: unknown operation
nocstaff@urbancom.net
11 Posts
Quote from nocstaff@urbancom.net on June 2, 2025, 6:30 pmI am running a Proxmox cluster with 4 Petasan nodes clustered for storage. Over the weekend 23 of my 32 drives are showing offline. THis is what I get when I check the status. How can I fix this?
root@uccsan01:~# ceph status
cluster:
id: 9774be8a-c28d-4b1a-b614-e99a0151d001
health: HEALTH_ERR
Module 'devicehealth' has failed: unknown operation
Reduced data availability: 100 pgs inactive, 59 pgs down, 11 pgs stale
Degraded data redundancy: 4961154/14820801 objects degraded (33.474%), 63 pgs degraded, 63 pgs undersized
2 slow ops, oldest one blocked for 38 sec, osd.5 has slow opsservices:
mon: 3 daemons, quorum uccsan03,uccsan01,uccsan02 (age 9h)
mgr: uccsan02(active, since 4M), standbys: uccsan03, uccsan01
osd: 32 osds: 9 up (since 2h), 9 in (since 2h); 64 remapped pgsdata:
pools: 2 pools, 129 pgs
objects: 4.94M objects, 19 TiB
usage: 34 TiB used, 98 TiB / 132 TiB avail
pgs: 77.519% pgs not active
4961154/14820801 objects degraded (33.474%)
461988/14820801 objects misplaced (3.117%)
46 down
39 undersized+degraded+remapped+backfill_wait+peered
21 active+undersized+degraded+remapped+backfill_wait
11 stale+down
7 active+clean
2 undersized+degraded+remapped+backfilling+peered
2 down+remapped
1 active+undersized+degradedio:
recovery: 48 MiB/s, 11 objects/sprogress:
Global Recovery Event (2d)
[=...........................] (remaining: 5w)
I am running a Proxmox cluster with 4 Petasan nodes clustered for storage. Over the weekend 23 of my 32 drives are showing offline. THis is what I get when I check the status. How can I fix this?
root@uccsan01:~# ceph status
cluster:
id: 9774be8a-c28d-4b1a-b614-e99a0151d001
health: HEALTH_ERR
Module 'devicehealth' has failed: unknown operation
Reduced data availability: 100 pgs inactive, 59 pgs down, 11 pgs stale
Degraded data redundancy: 4961154/14820801 objects degraded (33.474%), 63 pgs degraded, 63 pgs undersized
2 slow ops, oldest one blocked for 38 sec, osd.5 has slow ops
services:
mon: 3 daemons, quorum uccsan03,uccsan01,uccsan02 (age 9h)
mgr: uccsan02(active, since 4M), standbys: uccsan03, uccsan01
osd: 32 osds: 9 up (since 2h), 9 in (since 2h); 64 remapped pgs
data:
pools: 2 pools, 129 pgs
objects: 4.94M objects, 19 TiB
usage: 34 TiB used, 98 TiB / 132 TiB avail
pgs: 77.519% pgs not active
4961154/14820801 objects degraded (33.474%)
461988/14820801 objects misplaced (3.117%)
46 down
39 undersized+degraded+remapped+backfill_wait+peered
21 active+undersized+degraded+remapped+backfill_wait
11 stale+down
7 active+clean
2 undersized+degraded+remapped+backfilling+peered
2 down+remapped
1 active+undersized+degraded
io:
recovery: 48 MiB/s, 11 objects/s
progress:
Global Recovery Event (2d)
[=...........................] (remaining: 5w)
nocstaff@urbancom.net
11 Posts
Quote from nocstaff@urbancom.net on June 2, 2025, 6:33 pmI am running version 3.3
I am running version 3.3
admin
2,980 Posts
Quote from admin on June 3, 2025, 6:37 pmThe issue is not related to "Module 'devicehealth' has failed: unknown operation". You can ignore this error.
The problem is that OSDs are down, the primary focus is to get them up. There could be many problems that would lead to this including hardware/network, i can see from the status that even the monitors had a service up in last 3 hours, so it may not be just the OSDs. First try to reboot the cluster, else you will need to look at logs in /var/log/ceph, /var/log/syslog, dmesg.
The issue is not related to "Module 'devicehealth' has failed: unknown operation". You can ignore this error.
The problem is that OSDs are down, the primary focus is to get them up. There could be many problems that would lead to this including hardware/network, i can see from the status that even the monitors had a service up in last 3 hours, so it may not be just the OSDs. First try to reboot the cluster, else you will need to look at logs in /var/log/ceph, /var/log/syslog, dmesg.