Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

NFS Share Freeze (~3 Minutes) During MDS Failover

Hi Team,
I am currently testing failover behavior in our  cluster and observed an issue where the NFS share becomes unresponsive on the client side for approximately 3 minutes during failover. We would appreciate your guidance on whether this behavior is expected or if additional tuning/configuration is required.

Cluster Setup

  • FS is used as the backend for NFS exports.

Current MDS Architecture

 

  • 1 Active MDS – running on node2
  • 1 Standby-Replay MDS – running on node3
  • 1 Standby MDS – running on node1

 

Observed Behavior

During testing, when the active MDS node (node2) is powered off, the following occurs:

  1. The standby-replay MDS begins the failover process.
  2. FS transitions through recovery states (reconnect / clientreplay).
  3. During this period, NFS shares become inaccessible or hang.
  4. The total disruption lasts around 3 minutes before services resume.

Expected Behavior

Based on documentation and understanding of standby-replay MDS, we expected failover to complete within a few seconds, with minimal interruption to NFS clients.

Additional Details

  •  clients: ~3
  • Metadata pool and data pool are healthy.
  • No degraded PGs observed during the failover test.
  • NFS is exported via the NFS-Ganesha service.
Please suggest some resolution for the same.

No it is not normal.

Have you done any config/settings changes?

What type of hardware do you have: baremetal/virual, ssd/hdd, memory..

The following tuning parameters were applied to reduce failover time:

cluster waits before declaring an MDS daemon dead

ceph config set mds mds_beacon_grace 15

Reduce MDS reconnect timeout

ceph config set mds mds_reconnect_timeout 20

Maximum time a client session can remain inactive before MDS marks it stale

ceph config set mds session_timeout 30

Automatically close inactive client sessions

ceph config set mds session_autoclose 120

ceph fs set cephfs session_timeout 30

ceph fs set cephfs session_autoclose 120


Setup is on Test Vms

i recommend you test with real hardware and use ssd for metadata pool.