Incident – Intermittent NFS Storage Instability

All times are shown in UTC+01:00.

Server cloud SSD

Start: Dec 15, 2025 at 09:30 (UTC+01:00) | End: Dec 15, 2025 at 10:47 (UTC+01:00)

Resolved

Description An intermittent instability was detected on the san12 storage infrastructure, resulting in severe latency and temporary freezes when accessing data over NFS. The issue occurred randomly, with storage access sometimes behaving normally and at other times experiencing delays lasting several minutes.

Root cause The incident was caused by abnormal behavior on a single disk within a ZFS mirrored vdev. This disk exhibited extreme I/O latency, temporarily blocking storage operations at the server level and preventing timely NFS responses to client requests.

Although the ZFS pool remained online and data integrity was preserved, these I/O stalls caused visible service disruptions for systems relying on NFS storage.

Corrective actions Identification of the faulty disk responsible for the I/O stalls

Preventive offlining of the affected disk (service continuity ensured by ZFS redundancy)

Immediate stabilization of NFS access

Enhanced monitoring of the storage subsystem

Planned replacement of the faulty disk

Current status All services are operational and stable No data loss occurred thanks to the redundant storage configuration.

Next steps Permanent replacement of the affected disk Scheduled reboot of the storage server to apply NFS performance optimizations Additional monitoring to detect early signs of storage degradation

We apologize for the inconvenience caused and thank you for your understanding.

Timeline Updates:

All services are operational and stable No data loss occurred thanks to the redundant storage configuration.
Dec 15, 2025 at 11:47 (UTC+01:00)
Faulty disk replacement is currently in progress on the san12 storage system. The affected drive has been taken out of service and is being replaced with a spare disk. Thanks to storage redundancy, services remain operational and no data loss is expected. Monitoring continues during the resilvering process.
Dec 15, 2025 at 12:36 (UTC+01:00)
The storage pool is now stable and fully operational. The faulty disk has been replaced with a spare drive, and the data reconstruction process (resilvering) has completed successfully with no data loss.

A maintenance operation is planned to perform the physical replacement of the failed disk. Services remain available and under monitoring until this intervention.
Dec 15, 2025 at 13:00 (UTC+01:00)