Clarification on PetaSAN_Online_Upgrade_Guide

neiltorda
99 Posts
July 27, 2023, 12:49 pmQuote from neiltorda on July 27, 2023, 12:49 pmI am planning on running the upgrade on a new small cluster that has a few iSCSI volumes shared out.
In the upgrade guide the steps to go from 3.1.0, to the newest version are:
----------- I have added step numbers ----
- To begin upgrade, ensure the status of the cluster is OK, active/clean. For each node in the cluster perform the following steps, one node at a time:
2. apt update
3. apt install ca-certificates
4. /opt/petasan/scripts/online-updates/update.sh
5. When ALL nodes are updated, run following command: ceph osd require-osd-release quincy
I am assuming that there is an unwritten step between 4 and 5 of reboot each node after running step 4. Step 5 is run AFTER all nodes have had step 4 run and have been rebooted.
Is this the case?
Thanks,
Neil
I am planning on running the upgrade on a new small cluster that has a few iSCSI volumes shared out.
In the upgrade guide the steps to go from 3.1.0, to the newest version are:
----------- I have added step numbers ----
- To begin upgrade, ensure the status of the cluster is OK, active/clean. For each node in the cluster perform the following steps, one node at a time:
2. apt update
3. apt install ca-certificates
4. /opt/petasan/scripts/online-updates/update.sh
5. When ALL nodes are updated, run following command: ceph osd require-osd-release quincy
I am assuming that there is an unwritten step between 4 and 5 of reboot each node after running step 4. Step 5 is run AFTER all nodes have had step 4 run and have been rebooted.
Is this the case?
Thanks,
Neil
Last edited on July 27, 2023, 12:50 pm by neiltorda · #1

admin
2,974 Posts
July 27, 2023, 2:34 pmQuote from admin on July 27, 2023, 2:34 pmA reboot is required in case there was a kernel update and you need to run this new kernel. this is similar to online updates of most distros. we do restart the needed services ourselves so no reboot is needed unless you have a new kernel. We probably should automate the upgrade message at end to recommend reboot if needed.
3.2 has a new kernel. so to use new kernel you should reboot
A reboot is required in case there was a kernel update and you need to run this new kernel. this is similar to online updates of most distros. we do restart the needed services ourselves so no reboot is needed unless you have a new kernel. We probably should automate the upgrade message at end to recommend reboot if needed.
3.2 has a new kernel. so to use new kernel you should reboot
Last edited on July 27, 2023, 2:42 pm by admin · #2

neiltorda
99 Posts
July 27, 2023, 3:19 pmQuote from neiltorda on July 27, 2023, 3:19 pmGreat, thanks so much.
And step 5 above (ceph osd require-osd-release quincy) is… is that run on one node, or all nodes?
Neil
Great, thanks so much.
And step 5 above (ceph osd require-osd-release quincy) is… is that run on one node, or all nodes?
Neil

neiltorda
99 Posts
July 27, 2023, 4:04 pmQuote from neiltorda on July 27, 2023, 4:04 pmI have performed the steps above on 3 of the 4 nodes in my cluster. After running the commands on node3, the system will not come out of HEALTH_WARN
Output of ceph -s is:
Every 1.0s: ceph -s psan4: Thu Jul 27 11:59:13 2023
cluster:
id: c9-----------------2ebdfd8
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive
services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 37m)
mgr: psan4(active, starting, since 40m), standbys: psan1, psan2
mds: 3 up:standby
osd: 150 osds: 150 up (since 12m), 150 in (since 12m)
data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown
Currently psan4 is the active manager, and it is the one node I have not run the updates on yet.
Since the directions state to make sure HEALTH is OK before running the commands, I am not sure what I should do at this point.
I checked machines that have the iscsi disks mounted that are exported out from the petasan cluster, and the disks are mounted and can be accessed, but I am concerned about moving forward. What will happen if I run the updates on the final node if the HEALTH is not OK?
Thanks!
Neil
I have performed the steps above on 3 of the 4 nodes in my cluster. After running the commands on node3, the system will not come out of HEALTH_WARN
Output of ceph -s is:
Every 1.0s: ceph -s psan4: Thu Jul 27 11:59:13 2023
cluster:
id: c9-----------------2ebdfd8
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive
services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 37m)
mgr: psan4(active, starting, since 40m), standbys: psan1, psan2
mds: 3 up:standby
osd: 150 osds: 150 up (since 12m), 150 in (since 12m)
data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown
Currently psan4 is the active manager, and it is the one node I have not run the updates on yet.
Since the directions state to make sure HEALTH is OK before running the commands, I am not sure what I should do at this point.
I checked machines that have the iscsi disks mounted that are exported out from the petasan cluster, and the disks are mounted and can be accessed, but I am concerned about moving forward. What will happen if I run the updates on the final node if the HEALTH is not OK?
Thanks!
Neil

neiltorda
99 Posts
July 27, 2023, 4:35 pmQuote from neiltorda on July 27, 2023, 4:35 pmMore things that are currently broken:
In the web interface, there is no data being displayed on the main screen
iSCSI Disk list is empty:
https://em.wcu.edu/iscsiDisk.png
iSCSI Path Assignment screen is also empty:
https://em.wcu.edu/iscsipath.png
More things that are currently broken:
In the web interface, there is no data being displayed on the main screen
iSCSI Disk list is empty:
https://em.wcu.edu/iscsiDisk.png
iSCSI Path Assignment screen is also empty:
https://em.wcu.edu/iscsipath.png
Last edited on July 27, 2023, 4:37 pm by neiltorda · #5

admin
2,974 Posts
July 27, 2023, 4:52 pmQuote from admin on July 27, 2023, 4:52 pmi understand cliient i/o is ok correct ?
can you stop the mgr service on node 4 until another node takes the role then restart it, what does the ceph status show ?
i understand cliient i/o is ok correct ?
can you stop the mgr service on node 4 until another node takes the role then restart it, what does the ceph status show ?

neiltorda
99 Posts
July 27, 2023, 5:05 pmQuote from neiltorda on July 27, 2023, 5:05 pmwhat is the command to stop the mgr service? Would it be systemctl stop ceph-mgr@psan4.service?
When I do a systemctl status ceph-mgr, nothing returns, but there is a service called ceph-mgr@psan4.service
Just want to make sure I am stopping the correct item.
Thanks
Neil
what is the command to stop the mgr service? Would it be systemctl stop ceph-mgr@psan4.service?
When I do a systemctl status ceph-mgr, nothing returns, but there is a service called ceph-mgr@psan4.service
Just want to make sure I am stopping the correct item.
Thanks
Neil

admin
2,974 Posts
July 27, 2023, 5:18 pmQuote from admin on July 27, 2023, 5:18 pmyes the command is correct, run it on node 4 itself
yes the command is correct, run it on node 4 itself

neiltorda
99 Posts
July 27, 2023, 5:26 pmQuote from neiltorda on July 27, 2023, 5:26 pmI stopped the service on node4…
Same issues as reported above after node1 took over, notice psan4 is no longer listed as a mgr node.
cluster:
id: c9f--------------------dfd8c
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive
services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
mgr: psan1(active, starting, since 2m), standbys: psan2
osd: 150 osds: 150 up (since 95m), 150 in (since 95m)
data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown
I then restarted the mgr service on node4, ceph -s still reports the same thing, just that psan4 (4th node) is now a standby:
Every 1.0s: ceph -s psan4: Thu Jul 27 13:25:55 2023
cluster:
id: c9f0a-----------------fd8c
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive
services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
mgr: psan1(active, starting, since 6m), standbys: psan2, psan4
mds: 3 up:standby
osd: 150 osds: 150 up (since 99m), 150 in (since 99m)
data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown
I stopped the service on node4…
Same issues as reported above after node1 took over, notice psan4 is no longer listed as a mgr node.
cluster:
id: c9f--------------------dfd8c
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive
services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
mgr: psan1(active, starting, since 2m), standbys: psan2
osd: 150 osds: 150 up (since 95m), 150 in (since 95m)
data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown
I then restarted the mgr service on node4, ceph -s still reports the same thing, just that psan4 (4th node) is now a standby:
Every 1.0s: ceph -s psan4: Thu Jul 27 13:25:55 2023
cluster:
id: c9f0a-----------------fd8c
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive
services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
mgr: psan1(active, starting, since 6m), standbys: psan2, psan4
mds: 3 up:standby
osd: 150 osds: 150 up (since 99m), 150 in (since 99m)
data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown

admin
2,974 Posts
July 27, 2023, 6:21 pmQuote from admin on July 27, 2023, 6:21 pmis client i/o working ?
what is value of
ceph versions
ceph osd dump | grep release
is client i/o working ?
what is value of
ceph versions
ceph osd dump | grep release
Clarification on PetaSAN_Online_Upgrade_Guide
neiltorda
99 Posts
Quote from neiltorda on July 27, 2023, 12:49 pmI am planning on running the upgrade on a new small cluster that has a few iSCSI volumes shared out.
In the upgrade guide the steps to go from 3.1.0, to the newest version are:
----------- I have added step numbers ----
- To begin upgrade, ensure the status of the cluster is OK, active/clean. For each node in the cluster perform the following steps, one node at a time:
2. apt update 3. apt install ca-certificates4. /opt/petasan/scripts/online-updates/update.sh
5. When ALL nodes are updated, run following command: ceph osd require-osd-release quincy
I am assuming that there is an unwritten step between 4 and 5 of reboot each node after running step 4. Step 5 is run AFTER all nodes have had step 4 run and have been rebooted.
Is this the case?
Thanks,
Neil
I am planning on running the upgrade on a new small cluster that has a few iSCSI volumes shared out.
In the upgrade guide the steps to go from 3.1.0, to the newest version are:
----------- I have added step numbers ----
- To begin upgrade, ensure the status of the cluster is OK, active/clean. For each node in the cluster perform the following steps, one node at a time:
2. apt update 3. apt install ca-certificates
4. /opt/petasan/scripts/online-updates/update.sh
5. When ALL nodes are updated, run following command: ceph osd require-osd-release quincy
I am assuming that there is an unwritten step between 4 and 5 of reboot each node after running step 4. Step 5 is run AFTER all nodes have had step 4 run and have been rebooted.
Is this the case?
Thanks,
Neil
admin
2,974 Posts
Quote from admin on July 27, 2023, 2:34 pmA reboot is required in case there was a kernel update and you need to run this new kernel. this is similar to online updates of most distros. we do restart the needed services ourselves so no reboot is needed unless you have a new kernel. We probably should automate the upgrade message at end to recommend reboot if needed.
3.2 has a new kernel. so to use new kernel you should reboot
A reboot is required in case there was a kernel update and you need to run this new kernel. this is similar to online updates of most distros. we do restart the needed services ourselves so no reboot is needed unless you have a new kernel. We probably should automate the upgrade message at end to recommend reboot if needed.
3.2 has a new kernel. so to use new kernel you should reboot
neiltorda
99 Posts
Quote from neiltorda on July 27, 2023, 3:19 pmGreat, thanks so much.
And step 5 above (ceph osd require-osd-release quincy) is… is that run on one node, or all nodes?
Neil
Great, thanks so much.
And step 5 above (ceph osd require-osd-release quincy) is… is that run on one node, or all nodes?
Neil
neiltorda
99 Posts
Quote from neiltorda on July 27, 2023, 4:04 pmI have performed the steps above on 3 of the 4 nodes in my cluster. After running the commands on node3, the system will not come out of HEALTH_WARN
Output of ceph -s is:
Every 1.0s: ceph -s psan4: Thu Jul 27 11:59:13 2023
cluster:
id: c9-----------------2ebdfd8
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactiveservices:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 37m)
mgr: psan4(active, starting, since 40m), standbys: psan1, psan2
mds: 3 up:standby
osd: 150 osds: 150 up (since 12m), 150 in (since 12m)data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown
Currently psan4 is the active manager, and it is the one node I have not run the updates on yet.
Since the directions state to make sure HEALTH is OK before running the commands, I am not sure what I should do at this point.
I checked machines that have the iscsi disks mounted that are exported out from the petasan cluster, and the disks are mounted and can be accessed, but I am concerned about moving forward. What will happen if I run the updates on the final node if the HEALTH is not OK?
Thanks!
Neil
I have performed the steps above on 3 of the 4 nodes in my cluster. After running the commands on node3, the system will not come out of HEALTH_WARN
Output of ceph -s is:
Every 1.0s: ceph -s psan4: Thu Jul 27 11:59:13 2023
cluster:
id: c9-----------------2ebdfd8
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive
services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 37m)
mgr: psan4(active, starting, since 40m), standbys: psan1, psan2
mds: 3 up:standby
osd: 150 osds: 150 up (since 12m), 150 in (since 12m)
data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown
Currently psan4 is the active manager, and it is the one node I have not run the updates on yet.
Since the directions state to make sure HEALTH is OK before running the commands, I am not sure what I should do at this point.
I checked machines that have the iscsi disks mounted that are exported out from the petasan cluster, and the disks are mounted and can be accessed, but I am concerned about moving forward. What will happen if I run the updates on the final node if the HEALTH is not OK?
Thanks!
Neil
neiltorda
99 Posts
Quote from neiltorda on July 27, 2023, 4:35 pmMore things that are currently broken:
In the web interface, there is no data being displayed on the main screen
iSCSI Disk list is empty:
https://em.wcu.edu/iscsiDisk.pngiSCSI Path Assignment screen is also empty:
https://em.wcu.edu/iscsipath.png
More things that are currently broken:
In the web interface, there is no data being displayed on the main screen
iSCSI Disk list is empty:
https://em.wcu.edu/iscsiDisk.png
iSCSI Path Assignment screen is also empty:
https://em.wcu.edu/iscsipath.png
admin
2,974 Posts
Quote from admin on July 27, 2023, 4:52 pmi understand cliient i/o is ok correct ?
can you stop the mgr service on node 4 until another node takes the role then restart it, what does the ceph status show ?
i understand cliient i/o is ok correct ?
can you stop the mgr service on node 4 until another node takes the role then restart it, what does the ceph status show ?
neiltorda
99 Posts
Quote from neiltorda on July 27, 2023, 5:05 pmwhat is the command to stop the mgr service? Would it be systemctl stop ceph-mgr@psan4.service?
When I do a systemctl status ceph-mgr, nothing returns, but there is a service called ceph-mgr@psan4.service
Just want to make sure I am stopping the correct item.
Thanks
Neil
what is the command to stop the mgr service? Would it be systemctl stop ceph-mgr@psan4.service?
When I do a systemctl status ceph-mgr, nothing returns, but there is a service called ceph-mgr@psan4.service
Just want to make sure I am stopping the correct item.
Thanks
Neil
admin
2,974 Posts
Quote from admin on July 27, 2023, 5:18 pmyes the command is correct, run it on node 4 itself
yes the command is correct, run it on node 4 itself
neiltorda
99 Posts
Quote from neiltorda on July 27, 2023, 5:26 pmI stopped the service on node4…
Same issues as reported above after node1 took over, notice psan4 is no longer listed as a mgr node.cluster:
id: c9f--------------------dfd8c
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactiveservices:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
mgr: psan1(active, starting, since 2m), standbys: psan2
osd: 150 osds: 150 up (since 95m), 150 in (since 95m)data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown
I then restarted the mgr service on node4, ceph -s still reports the same thing, just that psan4 (4th node) is now a standby:
Every 1.0s: ceph -s psan4: Thu Jul 27 13:25:55 2023
cluster:
id: c9f0a-----------------fd8c
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactiveservices:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
mgr: psan1(active, starting, since 6m), standbys: psan2, psan4
mds: 3 up:standby
osd: 150 osds: 150 up (since 99m), 150 in (since 99m)data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown
I stopped the service on node4…
Same issues as reported above after node1 took over, notice psan4 is no longer listed as a mgr node.
cluster:
id: c9f--------------------dfd8c
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive
services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
mgr: psan1(active, starting, since 2m), standbys: psan2
osd: 150 osds: 150 up (since 95m), 150 in (since 95m)
data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown
I then restarted the mgr service on node4, ceph -s still reports the same thing, just that psan4 (4th node) is now a standby:
Every 1.0s: ceph -s psan4: Thu Jul 27 13:25:55 2023
cluster:
id: c9f0a-----------------fd8c
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive
services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
mgr: psan1(active, starting, since 6m), standbys: psan2, psan4
mds: 3 up:standby
osd: 150 osds: 150 up (since 99m), 150 in (since 99m)
data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown
admin
2,974 Posts
Quote from admin on July 27, 2023, 6:21 pmis client i/o working ?
what is value of
ceph versions
ceph osd dump | grep release
is client i/o working ?
what is value of
ceph versions
ceph osd dump | grep release