NFS Share Freeze (~3 Minutes) During MDS Failover

haris.joshy
3 Posts
March 12, 2026, 1:21 pmQuote from haris.joshy on March 12, 2026, 1:21 pmHi Team,
I am currently testing failover behavior in our cluster and observed an issue where the NFS share becomes unresponsive on the client side for approximately 3 minutes during failover. We would appreciate your guidance on whether this behavior is expected or if additional tuning/configuration is required.
Cluster Setup
- FS is used as the backend for NFS exports.
Current MDS Architecture
- 1 Active MDS – running on node2
- 1 Standby-Replay MDS – running on node3
- 1 Standby MDS – running on node1
Observed Behavior
During testing, when the active MDS node (node2) is powered off, the following occurs:
- The standby-replay MDS begins the failover process.
- FS transitions through recovery states (reconnect / clientreplay).
- During this period, NFS shares become inaccessible or hang.
- The total disruption lasts around 3 minutes before services resume.
Expected Behavior
Based on documentation and understanding of standby-replay MDS, we expected failover to complete within a few seconds, with minimal interruption to NFS clients.
Additional Details
- clients: ~3
- Metadata pool and data pool are healthy.
- No degraded PGs observed during the failover test.
- NFS is exported via the NFS-Ganesha service.
Please suggest some resolution for the same.
Hi Team,
I am currently testing failover behavior in our cluster and observed an issue where the NFS share becomes unresponsive on the client side for approximately 3 minutes during failover. We would appreciate your guidance on whether this behavior is expected or if additional tuning/configuration is required.
Cluster Setup
- FS is used as the backend for NFS exports.
Current MDS Architecture
- 1 Active MDS – running on node2
- 1 Standby-Replay MDS – running on node3
- 1 Standby MDS – running on node1
Observed Behavior
During testing, when the active MDS node (node2) is powered off, the following occurs:
- The standby-replay MDS begins the failover process.
- FS transitions through recovery states (reconnect / clientreplay).
- During this period, NFS shares become inaccessible or hang.
- The total disruption lasts around 3 minutes before services resume.
Expected Behavior
Based on documentation and understanding of standby-replay MDS, we expected failover to complete within a few seconds, with minimal interruption to NFS clients.
Additional Details
- clients: ~3
- Metadata pool and data pool are healthy.
- No degraded PGs observed during the failover test.
- NFS is exported via the NFS-Ganesha service.
Please suggest some resolution for the same.

admin
3,073 Posts
March 12, 2026, 3:02 pmQuote from admin on March 12, 2026, 3:02 pmNo it is not normal.
Have you done any config/settings changes?
What type of hardware do you have: baremetal/virual, ssd/hdd, memory..
No it is not normal.
Have you done any config/settings changes?
What type of hardware do you have: baremetal/virual, ssd/hdd, memory..

haris.joshy
3 Posts
March 13, 2026, 4:52 amQuote from haris.joshy on March 13, 2026, 4:52 amThe following tuning parameters were applied to reduce failover time:
cluster waits before declaring an MDS daemon dead
ceph config set mds mds_beacon_grace 15
Reduce MDS reconnect timeout
ceph config set mds mds_reconnect_timeout 20
Maximum time a client session can remain inactive before MDS marks it stale
ceph config set mds session_timeout 30
Automatically close inactive client sessions
ceph config set mds session_autoclose 120
ceph fs set cephfs session_timeout 30
ceph fs set cephfs session_autoclose 120
The following tuning parameters were applied to reduce failover time:
cluster waits before declaring an MDS daemon dead
ceph config set mds mds_beacon_grace 15
Reduce MDS reconnect timeout
ceph config set mds mds_reconnect_timeout 20
Maximum time a client session can remain inactive before MDS marks it stale
ceph config set mds session_timeout 30
Automatically close inactive client sessions
ceph config set mds session_autoclose 120
ceph fs set cephfs session_timeout 30
ceph fs set cephfs session_autoclose 120

haris.joshy
3 Posts
March 13, 2026, 4:52 amQuote from haris.joshy on March 13, 2026, 4:52 amSetup is on Test Vms
Setup is on Test Vms

admin
3,073 Posts
March 13, 2026, 9:06 pmQuote from admin on March 13, 2026, 9:06 pmi recommend you test with real hardware and use ssd for metadata pool.
i recommend you test with real hardware and use ssd for metadata pool.
NFS Share Freeze (~3 Minutes) During MDS Failover
haris.joshy
3 Posts
Quote from haris.joshy on March 12, 2026, 1:21 pmHi Team,I am currently testing failover behavior in our cluster and observed an issue where the NFS share becomes unresponsive on the client side for approximately 3 minutes during failover. We would appreciate your guidance on whether this behavior is expected or if additional tuning/configuration is required.Cluster Setup
- FS is used as the backend for NFS exports.
Current MDS Architecture
- 1 Active MDS – running on node2
- 1 Standby-Replay MDS – running on node3
- 1 Standby MDS – running on node1
Observed Behavior
During testing, when the active MDS node (node2) is powered off, the following occurs:
- The standby-replay MDS begins the failover process.
- FS transitions through recovery states (reconnect / clientreplay).
- During this period, NFS shares become inaccessible or hang.
- The total disruption lasts around 3 minutes before services resume.
Expected Behavior
Based on documentation and understanding of standby-replay MDS, we expected failover to complete within a few seconds, with minimal interruption to NFS clients.
Additional Details
- clients: ~3
- Metadata pool and data pool are healthy.
- No degraded PGs observed during the failover test.
- NFS is exported via the NFS-Ganesha service.
Please suggest some resolution for the same.
Cluster Setup
- FS is used as the backend for NFS exports.
Current MDS Architecture
- 1 Active MDS – running on node2
- 1 Standby-Replay MDS – running on node3
- 1 Standby MDS – running on node1
Observed Behavior
During testing, when the active MDS node (node2) is powered off, the following occurs:
- The standby-replay MDS begins the failover process.
- FS transitions through recovery states (reconnect / clientreplay).
- During this period, NFS shares become inaccessible or hang.
- The total disruption lasts around 3 minutes before services resume.
Expected Behavior
Based on documentation and understanding of standby-replay MDS, we expected failover to complete within a few seconds, with minimal interruption to NFS clients.
Additional Details
- clients: ~3
- Metadata pool and data pool are healthy.
- No degraded PGs observed during the failover test.
- NFS is exported via the NFS-Ganesha service.
admin
3,073 Posts
Quote from admin on March 12, 2026, 3:02 pmNo it is not normal.
Have you done any config/settings changes?
What type of hardware do you have: baremetal/virual, ssd/hdd, memory..
No it is not normal.
Have you done any config/settings changes?
What type of hardware do you have: baremetal/virual, ssd/hdd, memory..
haris.joshy
3 Posts
Quote from haris.joshy on March 13, 2026, 4:52 amThe following tuning parameters were applied to reduce failover time:
cluster waits before declaring an MDS daemon dead
ceph config set mds mds_beacon_grace 15
Reduce MDS reconnect timeout
ceph config set mds mds_reconnect_timeout 20
Maximum time a client session can remain inactive before MDS marks it stale
ceph config set mds session_timeout 30
Automatically close inactive client sessions
ceph config set mds session_autoclose 120
ceph fs set cephfs session_timeout 30
ceph fs set cephfs session_autoclose 120
The following tuning parameters were applied to reduce failover time:
cluster waits before declaring an MDS daemon dead
ceph config set mds mds_beacon_grace 15
Reduce MDS reconnect timeout
ceph config set mds mds_reconnect_timeout 20
Maximum time a client session can remain inactive before MDS marks it stale
ceph config set mds session_timeout 30
Automatically close inactive client sessions
ceph config set mds session_autoclose 120
ceph fs set cephfs session_timeout 30
ceph fs set cephfs session_autoclose 120
haris.joshy
3 Posts
Quote from haris.joshy on March 13, 2026, 4:52 amSetup is on Test Vms
Setup is on Test Vms
admin
3,073 Posts
Quote from admin on March 13, 2026, 9:06 pmi recommend you test with real hardware and use ssd for metadata pool.
i recommend you test with real hardware and use ssd for metadata pool.