Incident – Intermittent NFS Storage Instability

Scheduled on 15.12.2025 09:30:00 Status Finished Estimated finish 15.12.2025 10:47:00

Description
An intermittent instability was detected on the san12 storage infrastructure, resulting in severe latency and temporary freezes when accessing data over NFS.
The issue occurred randomly, with storage access sometimes behaving normally and at other times experiencing delays lasting several minutes.

Root cause
The incident was caused by abnormal behavior on a single disk within a ZFS mirrored vdev.
This disk exhibited extreme I/O latency, temporarily blocking storage operations at the server level and preventing timely NFS responses to client requests.

Although the ZFS pool remained online and data integrity was preserved, these I/O stalls caused visible service disruptions for systems relying on NFS storage.

Corrective actions
Identification of the faulty disk responsible for the I/O stalls

Preventive offlining of the affected disk (service continuity ensured by ZFS redundancy)

Immediate stabilization of NFS access

Enhanced monitoring of the storage subsystem

Planned replacement of the faulty disk

Current status
All services are operational and stable
No data loss occurred thanks to the redundant storage configuration.

Next steps
Permanent replacement of the affected disk
Scheduled reboot of the storage server to apply NFS performance optimizations
Additional monitoring to detect early signs of storage degradation

We apologize for the inconvenience caused and thank you for your understanding.

Related servers / services

NFS storage access
File access and backup operations for some hosted services

Date Action
15.12.2025 11:47:00 All services are operational and stable
No data loss occurred thanks to the redundant storage configuration.
15.12.2025 12:36:00 Faulty disk replacement is currently in progress on the san12 storage system. The affected drive has been taken out of service and is being replaced with a spare disk. Thanks to storage redundancy, services remain operational and no data loss is expected. Monitoring continues during the resilvering process.
15.12.2025 13:00:00 The storage pool is now stable and fully operational.
The faulty disk has been replaced with a spare drive, and the data reconstruction process (resilvering) has completed successfully with no data loss.

A maintenance operation is planned to perform the physical replacement of the failed disk. Services remain available and under monitoring until this intervention.