ForumGeneral DiscussionClarification on PetaSAN_Online_U …

You need to log in to create posts and topics. Login · Register

Clarification on PetaSAN_Online_Upgrade_Guide

neiltorda
99 Posts

July 27, 2023, 12:49 pm

I am planning on running the upgrade on a new small cluster that has a few iSCSI volumes shared out.

In the upgrade guide the steps to go from 3.1.0, to the newest version are:

----------- I have added step numbers ----

To begin upgrade, ensure the status of the cluster is OK, active/clean. For each node in the cluster perform the following steps, one node at a time:

2. apt update
3. apt install ca-certificates

4. /opt/petasan/scripts/online-updates/update.sh

5. When ALL nodes are updated, run following command: ceph osd require-osd-release quincy

I am assuming that there is an unwritten step between 4 and 5 of reboot each node after running step 4. Step 5 is run AFTER all nodes have had step 4 run and have been rebooted.

Is this the case?

Thanks,
Neil

admin
2,981 Posts

July 27, 2023, 2:34 pm

A reboot is required in case there was a kernel update and you need to run this new kernel. this is similar to online updates of most distros. we do restart the needed services ourselves so no reboot is needed unless you have a new kernel. We probably should automate the upgrade message at end to recommend reboot if needed.

3.2 has a new kernel. so to use new kernel you should reboot

neiltorda
99 Posts

July 27, 2023, 3:19 pm

Great, thanks so much.

And step 5 above (ceph osd require-osd-release quincy) is… is that run on one node, or all nodes?

Neil

neiltorda
99 Posts

July 27, 2023, 4:04 pm

I have performed the steps above on 3 of the 4 nodes in my cluster. After running the commands on node3, the system will not come out of HEALTH_WARN

Output of ceph -s is:

Every 1.0s: ceph -s psan4: Thu Jul 27 11:59:13 2023

cluster:
id: c9-----------------2ebdfd8
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive

services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 37m)
mgr: psan4(active, starting, since 40m), standbys: psan1, psan2
mds: 3 up:standby
osd: 150 osds: 150 up (since 12m), 150 in (since 12m)

data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown

Currently psan4 is the active manager, and it is the one node I have not run the updates on yet.

Since the directions state to make sure HEALTH is OK before running the commands, I am not sure what I should do at this point.

I checked machines that have the iscsi disks mounted that are exported out from the petasan cluster, and the disks are mounted and can be accessed, but I am concerned about moving forward. What will happen if I run the updates on the final node if the HEALTH is not OK?

Thanks!

Neil

neiltorda
99 Posts

July 27, 2023, 4:35 pm

More things that are currently broken:

In the web interface, there is no data being displayed on the main screen

https://em.wcu.edu/webui.png

iSCSI Disk list is empty:
https://em.wcu.edu/iscsiDisk.png

iSCSI Path Assignment screen is also empty:
https://em.wcu.edu/iscsipath.png

admin
2,981 Posts

July 27, 2023, 4:52 pm

i understand cliient i/o is ok correct ?

can you stop the mgr service on node 4 until another node takes the role then restart it, what does the ceph status show ?

neiltorda
99 Posts

July 27, 2023, 5:05 pm

what is the command to stop the mgr service? Would it be systemctl stop ceph-mgr@psan4.service?

When I do a systemctl status ceph-mgr, nothing returns, but there is a service called ceph-mgr@psan4.service

Just want to make sure I am stopping the correct item.

Thanks

Neil

admin
2,981 Posts

July 27, 2023, 5:18 pm

yes the command is correct, run it on node 4 itself

neiltorda
99 Posts

July 27, 2023, 5:26 pm

I stopped the service on node4…
Same issues as reported above after node1 took over, notice psan4 is no longer listed as a mgr node.

cluster:

id: c9f--------------------dfd8c

health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive

services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
mgr: psan1(active, starting, since 2m), standbys: psan2
osd: 150 osds: 150 up (since 95m), 150 in (since 95m)

data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown

I then restarted the mgr service on node4, ceph -s still reports the same thing, just that psan4 (4th node) is now a standby:

Every 1.0s: ceph -s psan4: Thu Jul 27 13:25:55 2023

cluster:

id: c9f0a-----------------fd8c
health: HEALTH_WARN
Reduced data availability: 4097 pgs inactive

services:
mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
mgr: psan1(active, starting, since 6m), standbys: psan2, psan4
mds: 3 up:standby
osd: 150 osds: 150 up (since 99m), 150 in (since 99m)

data:
pools: 3 pools, 4097 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
4097 unknown

admin
2,981 Posts

July 27, 2023, 6:21 pm

is client i/o working ?

what is value of

ceph versions

ceph osd dump | grep release