Forums - PetaSAN

ForumGeneral DiscussionNFS Share Freeze (~3 Minutes) Dur …
You need to log in to create posts and topics. Login · Register
NFS Share Freeze (~3 Minutes) During MDS Failover

haris.joshy
3 Posts

March 12, 2026, 1:21 pm
Quote from haris.joshy on March 12, 2026, 1:21 pm
Hi Team,

I am currently testing failover behavior in our cluster and observed an issue where the NFS share becomes unresponsive on the client side for approximately 3 minutes during failover. We would appreciate your guidance on whether this behavior is expected or if additional tuning/configuration is required.

Cluster Setup

FS is used as the backend for NFS exports.

Current MDS Architecture

1 Active MDS – running on node2

1 Standby-Replay MDS – running on node3

1 Standby MDS – running on node1

Observed Behavior

During testing, when the active MDS node (node2) is powered off, the following occurs:

The standby-replay MDS begins the failover process.

FS transitions through recovery states (reconnect / clientreplay).

During this period, NFS shares become inaccessible or hang.

The total disruption lasts around 3 minutes before services resume.

Expected Behavior

Based on documentation and understanding of standby-replay MDS, we expected failover to complete within a few seconds, with minimal interruption to NFS clients.

Additional Details

clients: ~3

Metadata pool and data pool are healthy.

No degraded PGs observed during the failover test.

NFS is exported via the NFS-Ganesha service.

Please suggest some resolution for the same.

Hi Team,

I am currently testing failover behavior in our cluster and observed an issue where the NFS share becomes unresponsive on the client side for approximately 3 minutes during failover. We would appreciate your guidance on whether this behavior is expected or if additional tuning/configuration is required.

Cluster Setup

FS is used as the backend for NFS exports.

Current MDS Architecture

1 Active MDS – running on node2

1 Standby-Replay MDS – running on node3

1 Standby MDS – running on node1

Observed Behavior

During testing, when the active MDS node (node2) is powered off, the following occurs:

The standby-replay MDS begins the failover process.

FS transitions through recovery states (reconnect / clientreplay).

During this period, NFS shares become inaccessible or hang.

The total disruption lasts around 3 minutes before services resume.

Expected Behavior

Based on documentation and understanding of standby-replay MDS, we expected failover to complete within a few seconds, with minimal interruption to NFS clients.

Additional Details

clients: ~3

Metadata pool and data pool are healthy.

No degraded PGs observed during the failover test.

NFS is exported via the NFS-Ganesha service.

Please suggest some resolution for the same.

#1

admin
3,082 Posts

March 12, 2026, 3:02 pm
Quote from admin on March 12, 2026, 3:02 pm
No it is not normal.

Have you done any config/settings changes?

What type of hardware do you have: baremetal/virual, ssd/hdd, memory..

No it is not normal.

Have you done any config/settings changes?

What type of hardware do you have: baremetal/virual, ssd/hdd, memory..

#2

haris.joshy
3 Posts

March 13, 2026, 4:52 am
Quote from haris.joshy on March 13, 2026, 4:52 am
The following tuning parameters were applied to reduce failover time:

cluster waits before declaring an MDS daemon dead

ceph config set mds mds_beacon_grace 15

Reduce MDS reconnect timeout

ceph config set mds mds_reconnect_timeout 20

Maximum time a client session can remain inactive before MDS marks it stale

ceph config set mds session_timeout 30

Automatically close inactive client sessions

ceph config set mds session_autoclose 120

ceph fs set cephfs session_timeout 30

ceph fs set cephfs session_autoclose 120

The following tuning parameters were applied to reduce failover time:

cluster waits before declaring an MDS daemon dead

ceph config set mds mds_beacon_grace 15

Reduce MDS reconnect timeout

ceph config set mds mds_reconnect_timeout 20

Maximum time a client session can remain inactive before MDS marks it stale

ceph config set mds session_timeout 30

Automatically close inactive client sessions

ceph config set mds session_autoclose 120

ceph fs set cephfs session_timeout 30

ceph fs set cephfs session_autoclose 120

#3

haris.joshy
3 Posts

March 13, 2026, 4:52 am
Quote from haris.joshy on March 13, 2026, 4:52 am
Setup is on Test Vms

Setup is on Test Vms

#4

admin
3,082 Posts

March 13, 2026, 9:06 pm
Quote from admin on March 13, 2026, 9:06 pm
i recommend you test with real hardware and use ssd for metadata pool.

i recommend you test with real hardware and use ssd for metadata pool.

#5

Post Reply: NFS Share Freeze (~3 Minutes) During MDS Failover

Cancel