Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Clarification on PetaSAN_Online_Upgrade_Guide

Pages: 1 2 3

I am planning on running the upgrade on a new small cluster that has a few iSCSI volumes shared out.

In the upgrade guide the steps to go from 3.1.0, to the newest version are:

----------- I have added step numbers ----

  1. To begin upgrade, ensure the status of the cluster is OK, active/clean. For each node in the cluster perform the following steps, one node at a time:
2. apt update
3. apt install ca-certificates

4. /opt/petasan/scripts/online-updates/update.sh

5. When ALL nodes are updated, run following command: ceph osd require-osd-release quincy


I am assuming that there is an unwritten step between 4 and 5 of reboot each node after running step 4. Step 5 is run AFTER all nodes have had step 4 run and have been rebooted.

Is this the case?

Thanks,
Neil

 

A reboot is required in case there was a kernel update and you need to run this new kernel. this is similar to online updates of most distros. we do restart the needed services ourselves so no reboot is needed unless you have a new kernel. We probably should automate the upgrade message at end to recommend reboot if needed.

3.2 has a new kernel. so to use new kernel you should reboot

 

Great, thanks so much.

And step 5 above (ceph osd require-osd-release quincy) is… is that run on one node, or all nodes?

Neil

I have performed the steps above on 3 of the 4 nodes in my cluster. After running the commands on node3, the system will not come out of HEALTH_WARN

Output of ceph -s is:

Every 1.0s: ceph -s                                                                                                                              psan4: Thu Jul 27 11:59:13 2023

  cluster:
    id:     c9-----------------2ebdfd8
    health: HEALTH_WARN
            Reduced data availability: 4097 pgs inactive

  services:
    mon: 3 daemons, quorum psan4,psan1,psan2 (age 37m)
    mgr: psan4(active, starting, since 40m), standbys: psan1, psan2
    mds:  3 up:standby
    osd: 150 osds: 150 up (since 12m), 150 in (since 12m)

  data:
   pools:   3 pools, 4097 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             4097 unknown

 

Currently psan4 is the active manager, and it is the one node I have not run the updates on yet.

Since the directions state to make sure HEALTH is OK before running the commands, I am not sure what I should do at this point.

I checked machines that have the iscsi disks mounted that are exported out from the petasan cluster, and the disks are mounted and can be accessed, but I am concerned about moving forward. What will happen if I run the updates on the final node if the HEALTH is not OK?

Thanks!

Neil

More things that are currently broken:

In the web interface, there is no data being displayed on the main screen

https://em.wcu.edu/webui.png

iSCSI Disk list is empty:
https://em.wcu.edu/iscsiDisk.png

iSCSI Path Assignment screen is also empty:
https://em.wcu.edu/iscsipath.png

i understand cliient i/o is ok correct ?

can you stop the mgr service on node 4 until another node takes the role then restart it, what does the ceph status show ?

what is the command to stop the mgr service? Would it be systemctl stop ceph-mgr@psan4.service?

When I do a systemctl status ceph-mgr, nothing returns, but there is a service called ceph-mgr@psan4.service

Just want to make sure I am stopping the correct item.

Thanks

Neil

yes the command is correct, run it on node 4 itself

I stopped the service on node4…
Same issues as reported above after node1 took over, notice psan4 is no longer listed as a mgr node.

cluster:

    id:     c9f--------------------dfd8c

    health: HEALTH_WARN
            Reduced data availability: 4097 pgs inactive

  services:
    mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
    mgr: psan1(active, starting, since 2m), standbys: psan2
    osd: 150 osds: 150 up (since 95m), 150 in (since 95m)

  data:
    pools:   3 pools, 4097 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             4097 unknown

 

I then restarted the mgr service on node4, ceph -s still reports the same thing, just that psan4 (4th node) is now a standby:

 

Every 1.0s: ceph -s                                                                                                                                                                                                                                             psan4: Thu Jul 27 13:25:55 2023

  cluster:

    id:     c9f0a-----------------fd8c
    health: HEALTH_WARN
            Reduced data availability: 4097 pgs inactive

  services:
    mon: 3 daemons, quorum psan4,psan1,psan2 (age 2h)
    mgr: psan1(active, starting, since 6m), standbys: psan2, psan4
    mds:  3 up:standby
    osd: 150 osds: 150 up (since 99m), 150 in (since 99m)

  data:
    pools:   3 pools, 4097 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             4097 unknown

is client i/o working ?

what is value of

ceph versions

ceph osd dump | grep release

Pages: 1 2 3