You need to log in to create posts and topics. Login · Register

CEPH Health Warning (MDS Error) following cluster powerdown

Hello,

After a power outage, our three nodes on the PetaSan cluster (version 3.1.0) are back up. The iSCSI service is running again, and everything seems fine, but our cluster is displaying a health warning: "insufficient standby MDS daemons available." Additionally, we have lost statistics graph for all three nodes. It's worth noting that when the power was restored, only Node #1 came back online automatically, while the other two nodes remained off (I guess due fencing). I had to manually clear out logs from the root  (/) of Node #1 because it was full. Here are some of the outputs for troubleshooting all ran from Node #1:

ceph status

cluster:
id: XXXYYYZZZ
health: HEALTH_WARN
insufficient standby MDS daemons available

services:
mon: 3 daemons, quorum ceph-node3,ceph-node1,ceph-node2 (age 83m)
mgr: ceph-node2(active, since 100m), standbys: ceph-node3, ceph-node1
mds: cephfs:1 {0=ceph-node2=up:active}
osd: 18 osds: 18 up (since 83m), 18 in (since 100m)
rgw: 3 daemons active (ceph-node1, ceph-node2, ceph-node3)

task status:

data:
pools: 12 pools, 897 pgs
objects: 2.31M objects, 8.3 TiB
usage: 24 TiB used, 15 TiB / 38 TiB avail
pgs: 897 active+clean

io:
client: 130 KiB/s rd, 4.0 MiB/s wr, 4 op/s rd, 351 op/s wr

ceph fs status

cephfs - 0 clients
======
RANK STATE MDS ACTIVITY DNS INOS
0 active ceph-node2 Reqs: 0 /s 103 26
POOL TYPE USED AVAIL
cephfs_metadata metadata 4839k 3398G
cephfs_data data 63.6G 3398G
MDS version: ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)

ctdb status

connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
Failed to read nodes file "/etc/ctdb/nodes"
Is this node part of CTDB cluster?

tail -50 /var/log/samba/log.ctdb

2024/03/28 06:03:37.966612 ctdb-recoverd[4587]: ../../ctdb/server/ctdb_recoverd.c:1110 Starting do_recovery
2024/03/28 06:03:37.966636 ctdb-recoverd[4587]: Attempting to take recovery lock (/opt/petasan/config/shared/ctdb/lockfile)
2024/03/28 06:03:37.973330 ctdbd[4484]: /usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_fcntl_helper: Unable to open /opt/petasan/config/shared/ctdb/lockfile - (Transport endpoint is not connected)
2024/03/28 06:03:37.973347 ctdb-recoverd[4587]: Unable to take recover lock - unknown error
2024/03/28 06:03:37.973365 ctdb-recoverd[4587]: Banning this node
2024/03/28 06:03:37.973372 ctdb-recoverd[4587]: Banning node 1 for 300 seconds
2024/03/28 06:03:37.973395 ctdbd[4484]: Banning this node for 300 seconds
2024/03/28 06:03:37.973404 ctdbd[4484]: Making node INACTIVE
2024/03/28 06:03:37.973414 ctdbd[4484]: Dropping all public IP addresses
2024/03/28 06:03:37.973421 ctdbd[4484]: ../../ctdb/server/ctdb_takeover.c:1675 Released 0 public IPs
2024/03/28 06:03:37.973432 ctdbd[4484]: Freeze all: frozen
2024/03/28 06:03:37.973450 ctdb-recoverd[4587]: Unable to take recovery lock
2024/03/28 06:03:37.973459 ctdb-recoverd[4587]: Abort recovery, ban this node for 300 seconds
2024/03/28 06:03:37.973465 ctdb-recoverd[4587]: Banning node 1 for 300 seconds
2024/03/28 06:03:37.973480 ctdbd[4484]: Banning this node for 300 seconds
2024/03/28 06:03:38.967760 ctdbd[4484]: Freeze all: frozen
2024/03/28 06:08:37.973866 ctdbd[4484]: Banning timed out
2024/03/28 06:08:38.306077 ctdb-recoverd[4587]: ../../ctdb/server/ctdb_recoverd.c:1110 Starting do_recovery
2024/03/28 06:08:38.306105 ctdb-recoverd[4587]: Attempting to take recovery lock (/opt/petasan/config/shared/ctdb/lockfile)
2024/03/28 06:08:38.312777 ctdbd[4484]: /usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_fcntl_helper: Unable to open /opt/petasan/config/shared/ctdb/lockfile - (Transport endpoint is not connected)
2024/03/28 06:08:38.312790 ctdb-recoverd[4587]: Unable to take recover lock - unknown error
2024/03/28 06:08:38.312809 ctdb-recoverd[4587]: Banning this node
2024/03/28 06:08:38.312816 ctdb-recoverd[4587]: Banning node 1 for 300 seconds
2024/03/28 06:08:38.312842 ctdbd[4484]: Banning this node for 300 seconds
2024/03/28 06:08:38.312852 ctdbd[4484]: Making node INACTIVE
2024/03/28 06:08:38.312864 ctdbd[4484]: Dropping all public IP addresses
2024/03/28 06:08:38.312872 ctdbd[4484]: ../../ctdb/server/ctdb_takeover.c:1675 Released 0 public IPs
2024/03/28 06:08:38.312883 ctdbd[442024/03/28 07:41:35.508456 ctdbd[4484]: Recovery has started
2024/03/28 07:41:36.334737 ctdb-eventd[4486]: Received signal 15
2024/03/28 07:41:36.334783 ctdb-eventd[4486]: Shutting down
2024/03/28 07:41:36.334805 ctdbd[4484]: Shutdown sequence complete, exiting.
2024/03/28 07:41:36.334851 ctdbd[4484]: CTDB daemon shutting down

tail -50 /var/log/samba/log.smbd

[2023/03/04 02:44:36.897891, 2] ../../source3/lib/dmallocmsg.c:78(register_dmalloc_msgs)
Registered MSG_REQ_DMALLOC_MARK and LOG_CHANGED
[2023/03/04 03:00:46.320795, 0] ../../source3/smbd/server.c:1784(main)
smbd version 4.13.17-Ubuntu started.
Copyright Andrew Tridgell and the Samba Team 1992-2020
[2023/03/04 03:00:46.508563, 2] ../../source3/lib/tallocmsg.c:84(register_msg_pool_usage)
Registered MSG_REQ_POOL_USAGE
[2023/03/04 03:00:46.508621, 2] ../../source3/lib/dmallocmsg.c:78(register_dmalloc_msgs)
Registered MSG_REQ_DMALLOC_MARK and LOG_CHANGED
[2023/03/04 03:01:23.673329, 0] ../../source3/smbd/server.c:1784(main)
smbd version 4.13.17-Ubuntu started.
Copyright Andrew Tridgell and the Samba Team 1992-2020
[2023/03/04 03:01:23.674658, 2] ../../source3/lib/tallocmsg.c:84(register_msg_pool_usage)
Registered MSG_REQ_POOL_USAGE
[2023/03/04 03:01:23.674689, 2] ../../source3/lib/dmallocmsg.c:78(register_dmalloc_msgs)
Registered MSG_REQ_DMALLOC_MARK and LOG_CHANGED
[2024/03/28 03:22:50.117132, 0] ../../source3/smbd/server.c:1784(main)
smbd version 4.13.17-Ubuntu started.
Copyright Andrew Tridgell and the Samba Team 1992-2020
[2024/03/28 03:22:50.122559, 2] ../../source3/lib/tallocmsg.c:84(register_msg_pool_usage)
Registered MSG_REQ_POOL_USAGE
[2024/03/28 03:22:50.122588, 2] ../../source3/lib/dmallocmsg.c:78(register_dmalloc_msgs)
Registered MSG_REQ_DMALLOC_MARK and LOG_CHANGED

"mount | grep mnt" and "mount | grep shared" don't show anything

PetaSan logs from Node #1:

28/03/2024 09:16:44 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:16:44 INFO CIFSService init action

28/03/2024 09:16:02 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:15:45 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:15:38 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:15:25 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:15:18 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:15:07 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:14:56 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:14:47 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:14:40 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:14:26 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:14:20 ERROR Error running cmd : gluster volume status gfs-vol

28/03/2024 09:14:20 INFO CIFSService init action

28/03/2024 09:13:49 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:13:49 INFO CIFSService init action

28/03/2024 09:13:19 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:13:19 INFO CIFSService init action

28/03/2024 09:12:48 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:12:48 INFO CIFSService init action

28/03/2024 09:12:18 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:12:18 INFO CIFSService init action

28/03/2024 09:11:47 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:11:47 INFO CIFSService init action

28/03/2024 09:11:05 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:10:51 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:10:44 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:10:31 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:10:21 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:10:10 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:09:59 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:09:50 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:09:43 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:09:29 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:09:24 ERROR Error running cmd : gluster volume status gfs-vol

28/03/2024 09:09:24 INFO CIFSService init action

28/03/2024 09:08:53 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:08:53 INFO CIFSService init action

28/03/2024 09:08:23 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:08:23 INFO CIFSService init action

28/03/2024 09:07:52 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:07:52 INFO CIFSService init action

28/03/2024 09:07:22 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:07:22 INFO CIFSService init action

28/03/2024 09:06:39 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:06:23 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:06:16 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:06:03 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:05:56 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:05:45 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:05:34 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:05:25 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:05:18 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:05:04 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:04:58 ERROR Error running cmd : gluster volume status gfs-vol

28/03/2024 09:04:58 INFO CIFSService init action

28/03/2024 09:04:27 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:04:27 INFO CIFSService init action

28/03/2024 09:03:57 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:03:57 INFO CIFSService init action

28/03/2024 09:03:26 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:03:26 INFO CIFSService init action

28/03/2024 09:02:56 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:02:56 INFO CIFSService init action

28/03/2024 09:02:25 ERROR CIFS init shared filesystem not mounted

28/03/2024 09:02:25 INFO CIFSService init action

28/03/2024 09:01:43 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:01:29 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:01:22 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:01:09 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:00:59 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:00:48 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:00:37 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:00:28 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:00:21 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:00:07 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 09:00:02 ERROR Error running cmd : gluster volume status gfs-vol

28/03/2024 09:00:02 INFO CIFSService init action

28/03/2024 08:59:31 ERROR CIFS init shared filesystem not mounted

28/03/2024 08:59:31 INFO CIFSService init action

28/03/2024 08:59:01 ERROR CIFS init shared filesystem not mounted

28/03/2024 08:59:01 INFO CIFSService init action

28/03/2024 08:58:30 ERROR CIFS init shared filesystem not mounted

28/03/2024 08:58:30 INFO CIFSService init action

28/03/2024 08:58:00 ERROR CIFS init shared filesystem not mounted

28/03/2024 08:58:00 INFO CIFSService init action

28/03/2024 08:57:17 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 08:57:01 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 08:56:54 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 08:56:41 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 08:56:34 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 08:56:23 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 08:56:12 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 08:56:03 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 08:55:56 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 08:55:42 ERROR CTDBNodesFile get_node_ips exception

28/03/2024 08:55:37 ERROR Error running cmd : gluster volume status gfs-vol

28/03/2024 08:55:37 INFO CIFSService init action

28/03/2024 08:55:06 ERROR CIFS init shared filesystem not mounted

28/03/2024 08:55:06 INFO CIFSService init action

28/03/2024 08:54:36 ERROR CIFS init shared filesystem not mounted

28/03/2024 08:54:36 INFO CIFSService init action

PetaSan logs from Node #2:

28/03/2024 09:17:33 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:16:26 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:15:20 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:14:11 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:13:05 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:11:56 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:10:49 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:09:41 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:08:35 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:07:28 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:06:20 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:05:11 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:04:02 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:02:56 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:01:47 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:00:39 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 08:59:29 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 08:58:21 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 08:57:13 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 08:56:08 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 08:55:01 INFO CIFS check_health ctdb not active, restarting.

PetaSan logs from Node #3:

28/03/2024 09:19:25 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:18:16 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:17:07 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:16:01 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:14:54 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:13:45 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:12:37 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:11:30 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:10:22 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:09:14 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:08:07 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:07:00 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:05:53 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:04:46 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:03:39 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:02:32 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:01:25 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 09:00:19 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 08:59:10 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 08:58:04 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 08:56:56 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 08:55:47 INFO CIFS check_health ctdb not active, restarting.

28/03/2024 08:54:40 INFO CIFS check_health ctdb not active, restarting.