Warning to MergerFS+SnapRAID Users
Are you using MergerFS+SnapRAID to store and protect your important data collection? If you are, then you could be exposing yourself to a data loss. However, it is important to qualify the previous statement by saying the data loss was not due to a bug of either MergerFS or SnapRAID. It was due to a known design limitation that was made as an engineering trade off by the creators.
I was running an Ubuntu 16.04 system with Samba, Plex, MergerFS, and SnapRAID. My disk layout was as follows:
# DATA: 1TB WD Black (12 months old) 3TB WD Green (5 years old, heavy usage) 4TB WD Red (2 months old) # PARITY 6TB Seagate Barracuda (3 years old, very light usage) 4TB WD Red (2 months old) 4TB WD Red (2 months old)
Nightly SnapRAID syncs and scrubs (5%) are performed. Additionally, the day before a long SMART test was completed without issues on the 3TB WD Green. SMART ID’s 5, 197, 198 had always been zero. Basically, besides age, the WD Green was indicating perfect health.
I moved a file from my desktop to the file server like I had done a thousand times before. However, on the very next sync, SnapRAID threw an error reporting that it could not read the file due to an I/O error on the drive. Further, the file was no longer accessible. Basically, the file was moved to the file server, the disk wrote the file to the platter and reported that the write was successful when it really was not successful.
Since I “moved” instead of “copied” the file, it was no longer accessible on my desktop. Effectively, it was now lost.
My data fell through a type of write hole that should not be unexpected. Of course, this is not the fault of either SnapRAID or MergerFS and I will now present two solutions that would have eliminated this issue.
Ultimately, I chose the later solution.
Solution #1
- For each file, or set of files, that will be moved to the file server, create a PAR2 recovery set on the desktop PC before moving them. My favorite application for this is MultiPar. I choose a recovery amount of 10% or 1 file, whichever is smaller.
- Verify the PAR2 recovery set before copying the file to the file-server.
- Copy (not move) the files to the file server.
- Re-verify the integrity of the files that are now on the server using the PAR2 recovery set.
- Run a SnapRAID sync to create the parity and protection.
- Delete the local copy of the file.
Solution #2
Move to ZFS on Linux and stop worrying about all this. Done.
I decided to go with solution #2. Data is hashed and check-summed before it even hits the disk. I use mirrored VDEV’s for maximum flexibility and at $160 per 8TB, it is difficult to complain about the storage efficiency. So far, I have been very happy. The striped-mirrored VDEV’s provide boatloads of I/O and the odds that the exact same bits are corrupted on both drives in any mirror is infinitesimal. I monitor htop constantly and even though ZFS uses a good portion of available RAM, it leaves plenty for the other processes I have running on the machine. Of course, I still maintain backups because RAID is not a backup.