3 Node Cluster - 1 of the nodes was down because of a CPU issue

nocstaff@urbancom.net
9 Posts
August 11, 2023, 2:33 pmQuote from nocstaff@urbancom.net on August 11, 2023, 2:33 pmI have a 3 node cluster and lost 1 of my nodes because of a cpu failure. It took almost 2 weeks to get the parts and restore the node. I brought the node back online and it has been almost a day and I am still getting warnings. The restored node has 8 OSD's but only 3 are showing as up.
Here is the current health
ceph health
HEALTH_WARN 2 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 15 pgs backfill_toofull; Degraded data redundancy: 1127223/9042237 objects degraded (12.466%), 193 pgs degraded, 193 pgs undersized; 513 pgs not deep-scrubbed in time; 513 pgs not scrubbed in time; 2 pool(s) nearfull
Should I leave it alone and let it resolve on its own or is there something I should be doing. I am concerned because of the low space warning
I have a 3 node cluster and lost 1 of my nodes because of a cpu failure. It took almost 2 weeks to get the parts and restore the node. I brought the node back online and it has been almost a day and I am still getting warnings. The restored node has 8 OSD's but only 3 are showing as up.
Here is the current health
ceph health
HEALTH_WARN 2 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 15 pgs backfill_toofull; Degraded data redundancy: 1127223/9042237 objects degraded (12.466%), 193 pgs degraded, 193 pgs undersized; 513 pgs not deep-scrubbed in time; 513 pgs not scrubbed in time; 2 pool(s) nearfull
Should I leave it alone and let it resolve on its own or is there something I should be doing. I am concerned because of the low space warning

admin
2,974 Posts
August 11, 2023, 7:49 pmQuote from admin on August 11, 2023, 7:49 pmI would lower the OSD crush weight on all the OSD s in the problem node. then try to start the 5 down OSDs, if they fail to start look at their logs to try to find the ptoblem. if they have been damaged from the initial failure you need to replace them,
I would lower the OSD crush weight on all the OSD s in the problem node. then try to start the 5 down OSDs, if they fail to start look at their logs to try to find the ptoblem. if they have been damaged from the initial failure you need to replace them,
3 Node Cluster - 1 of the nodes was down because of a CPU issue
nocstaff@urbancom.net
9 Posts
Quote from nocstaff@urbancom.net on August 11, 2023, 2:33 pmI have a 3 node cluster and lost 1 of my nodes because of a cpu failure. It took almost 2 weeks to get the parts and restore the node. I brought the node back online and it has been almost a day and I am still getting warnings. The restored node has 8 OSD's but only 3 are showing as up.
Here is the current health
ceph health
HEALTH_WARN 2 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 15 pgs backfill_toofull; Degraded data redundancy: 1127223/9042237 objects degraded (12.466%), 193 pgs degraded, 193 pgs undersized; 513 pgs not deep-scrubbed in time; 513 pgs not scrubbed in time; 2 pool(s) nearfullShould I leave it alone and let it resolve on its own or is there something I should be doing. I am concerned because of the low space warning
I have a 3 node cluster and lost 1 of my nodes because of a cpu failure. It took almost 2 weeks to get the parts and restore the node. I brought the node back online and it has been almost a day and I am still getting warnings. The restored node has 8 OSD's but only 3 are showing as up.
Here is the current health
ceph health
HEALTH_WARN 2 nearfull osd(s); Low space hindering backfill (add storage if this doesn't resolve itself): 15 pgs backfill_toofull; Degraded data redundancy: 1127223/9042237 objects degraded (12.466%), 193 pgs degraded, 193 pgs undersized; 513 pgs not deep-scrubbed in time; 513 pgs not scrubbed in time; 2 pool(s) nearfull
Should I leave it alone and let it resolve on its own or is there something I should be doing. I am concerned because of the low space warning
admin
2,974 Posts
Quote from admin on August 11, 2023, 7:49 pmI would lower the OSD crush weight on all the OSD s in the problem node. then try to start the 5 down OSDs, if they fail to start look at their logs to try to find the ptoblem. if they have been damaged from the initial failure you need to replace them,
I would lower the OSD crush weight on all the OSD s in the problem node. then try to start the 5 down OSDs, if they fail to start look at their logs to try to find the ptoblem. if they have been damaged from the initial failure you need to replace them,