Setting up a ZFS RAID NAS for my pictures

Triangular stack of six WD RED 4TB NAS hard disk drives on the author's desk

Over the years I’ve shot many pictures and videos, and the required disk space has accumulated rapidly, especially once I started shooting in RAW format. Until two weeks back, my storage approach consisted of two pairs of 2 TB 2.5” HDDs in USB enclosures, formatted to NTFS and mirrored on the file level. (Back when I created this setup, I was still dual booting.) Apart from running out of space, it had become a total disgrace and it also meant pictures weren’t accessible to family members. Things had to change.

My conclusion was to go for a NAS. I briefly considered cloud storage, but requiring both high availability and half a dozen terrabytes would amount to a significant recurring cost. Also, I want to control my data myself.

This article covers the first part of the transition - setting up the NAS with ZFS RAIDZ2.

Hard- and software selection

Software

My particular requirements were as follows:

Tolerates two disk failures because there’s a good chance that a second disk breaks when reconstructing a first failed disk in a RAID array. And once a RAID array fails, it’s gone. Of course, I’ll be storing this data also somewhere else, but I would rather not have to execute that time-consuming recovery.
Snapshot capability without duplication Almost all backup software provides the capability to rollback to a previous time instant, yet without much exception they require dedicating half your disk space to that backup. Technically speaking, that’s unnecessary, as one only has to store the changes since that snapshot, not a full copy.

Let’s start with the second, because that one already narrows down the choice considerably. Two common filesystems support this snapshotting approach: Btrfs and ZFS. For this approach, there’s no additional software required - what happens is the so-called copy-on-write: when a file is modified, the old one is kept, so a rollback boils down to deleting the new file and marking the old file as present again. This process still requires some space, but only for files that changed since the latest snapshot, and as most of my pictures don’t change anyway, the required space is negligible.

From the first requirement, it follows that I’d want at least six disks, to bring the used disk ratio to a reasonable 2/3rds (two of the six disks are for redundancy). RAID comes in many flavors:

hardware RAID
I’m not considering that, too black box for my liking.
Btrfs RAID6
Sounds good, however this particular configuration suffers from write integrity bugs. Not exactly what I’m aiming for here.
Linux RAID6 + ZFS
Very mature. That said, Linux kernel mdadm can also be picky at times, and show the weirdest symptoms.
ZFS RAIDZ2
For my application, this one’s actually more suited than Linux RAID:
- Better integrity than Linux RAID6, because it relies on checksums in addition to parity and similar schemes. A key difference here is that RAIDZ2 handles the case where one disk fails and another’s (partially) silently corrupted: RAIDZ2 figures out which one’s corrupted using the checksums (and can then correct that corruption), while RAID6 can only determine that there’s a disagreement. (Linux RAID relies on the disk to report corruption.)
- One-stop-shop: have to mess around with only one thing.

So, ZFS RAIDZ2 it is. In addition to the above, it has several nice features like built-in caching and very configurable compression. And on top of that, it’s blazing fast! I’ll discuss that later; let’s first look at the hardware.

Hardware

Given that setting this up takes considerable time, I also want it to last, and therefore I went with 4 TB disks, which makes for 16 TB usable. A bit more than I need, but 3 TB disks don’t make sense because they’re almost the same price. Evidently, the easy option would have been to fork out some money for a dedicated NAS. I chose not to, for the following three reasons:

$$$: Cost would be at least €1930 (€873 cheapest six-bay NAS + 6 × €176 disks).
Environmental: electronics production pollutes heavily, and a NAS sips energy all the time. A machine that does dual duty as the family PC is therefore preferred.
¿ What’s the fun in that ?

Therefore, I sourced everything secondhand (except for some SATA cables). From a company sale, I got an old desktop tower (i7-4770K, six SATA interfaces) for €30. For the disks, I decided to put the RAID acronym back to its original Redundant Array of Inexpensive Disks and obtained, from various sources, six WD RED 4TB disks for €349. Together with the new cables (€27), this brings the total to an acceptable €406.

The six disks installed in the desktop box. I'd have preferred to have a box with disk trays, so let's hope these disks don't fail too often.

The six HDDs installed in my desktop tower

Concerning OS, I’m running Linux Mint. OSes dedicated to NAS duty like TrueNAS are more suited, though that would make it complicated to let this machine run dual duty as the family PC.

Setting up ZFS

OpenZFS drivers

ZFS originated at Sun Microsystems in 2001, where they needed a robust file system for their Solaris OS. Currently, OpenZFS leads the development, and it’s fully open source. Installation takes a few minutes (it involves installing a kernel module):

sudo apt install zfsutils-linux

Pool

I started by setting up a ZFS storage pool, which unites the six disks into one continuous storage space. Note that these settings cannot be adjusted later (luckily, ZFS is much more flexible after pool creation). A few notes:

ashift=12 to force ZFS to use a 4K alignment, which matches sector size of the disks I’m using (WD RED 4 TB). Choosing a wrong value here wrecks write performance.
vault, the name of my storage pool. This will also show up in the file path.
raidz2 to have two redundant disks. ZFS provides many options for this value, and it allows to optimize the system towards speed, capacity and reliability. RAIDZ2 provides an excellent middle ground here.
Refer to disks by their by-id paths, which are based on the disk’s hardware identifiers, and are independent of the host system. Compared to using paths like /dev/sda, this avoids running into issues when I have to reconstruct the array if the host machine ever fails.

sudo zpool create -o ashift=12 vault raidz2 \
    /dev/disk/by-id/ata-WDC_WD40EFAX-68JH4N0_WD-WX32D7088AF1 \
    /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K2RSYV6X \
    /dev/disk/by-id/ata-WDC_WD40EFZX-68AWUN0_WD-WX12DA0EAU0T \
    /dev/disk/by-id/ata-WDC_WD40EFZX-68AWUN0_WD-WX22DC0A3T34 \
    /dev/disk/by-id/ata-WDC_WD40EFZX-68AWUN0_WD-WX42D3173AVA \
    /dev/disk/by-id/ata-WDC_WD40EFZX-68AWUN0_WD-WXC2D903LAZ1

Datasets

Creation of the pool took care of the disks and RAID; next up is setting up the filesystem. For this, ZFS provides the concept of datasets. Configuration and snapshots are executed on the dataset level, and as such this concept provides a means to optimize parts of the pool for different uses. Datasets are hierarchical, and by default, the pool’s root directory is a dataset, so let’s start with configuring that one:

# Disable access-time updates - no need for that
sudo zfs set atime=off vault
# Enable compression (discussed in depth in the next section)
sudo zfs set compression=zstd-1 vault

For now, I decided to set up three datasets: pictures, storage and backup. pictures is where I’ll store all my images and videos, storage for files that I want to keep but not store on my laptop, and backup should only be used for storing duplicates. The latter two I created with default settings, however, for the pictures dataset I used a larger recordsize of 1 MB compared to the default 128 KB.

sudo zfs create -o recordsize=1M vault/pictures
sudo zfs create vault/storage
sudo zfs create vault/backup

There’s a lot to say about recordsize. Note that it only affects files larger than this size. Large files get split into an integer number of recordsize chunks, and any interaction with that file always involves reading or writing a full chunk. For the pictures dataset, I’m always reading or writing a full file, with the exception of playing videos, and there the 1 MB chunk size is sufficiently small (1 MB = 0.2 s of video at 50 Mbps). In this usage, large chunks make for faster operation because of lower bookkeeping overhead, less fragmentation and more efficient compression (because compression occurs per-chunk). However, this large chunk size makes this dataset unsuited for files that are updated by small modifications only, such as databases. So, if I store my digiKam database on the NAS, then I must not put it in the pictures dataset, and instead create a new dataset with a small recordsize.

Fortunately, most dataset properties can be changed after creation, though they don’t apply retroactively to existing files.

Compression

Many guides recommend using zstd-3 compression. It’s a great algorithm, but unfortunately it’s just not effective on my pictures and videos:

arend@server-002:/vault$ zfs get all vault/pictures | grep compress
vault/pictures  compressratio         1.03x                              -
vault/pictures  compression           zstd-3                             local
vault/pictures  refcompressratio      1.03x                              -

(Of note is that I use Sony and Nikon cameras, which both already compress their RAW files. I assume uncompressed RAW files will show higher ratios.) To add insult to injury, with my venerable i7-4770K, it reduced the write speed to a depressingly low 73 MB/s, which is below the 120 MB/s of my gigabit home network:

arend@server-002:~$ fio --name=zfs_seqwrite --rw=write --bs=4M --size=32G \
    --numjobs=1 --ioengine=libaio --iodepth=16 --direct=1 --end_fsync=1 \
    --filename=/vault/pictures/fio_testfile --output-format=normal
{...}
  WRITE: bw=69.1MiB/s (72.5MB/s), {...}

That clearly won’t do. First, let’s see what these disks are really capable of. Theoretically, because ZFS distributes each file over four disks (actually six, but two of those are parity and get redundant data), the write speed should be four times that of an individual disk. According to a single disk benchmark, this averages to about 130 MB/s, which makes for an expected pool speed of 520 MB/s. After disabling compression, I get a continuous write speed of 509 MB/s:

arend@server-002:/vault$ sudo zfs set compression=off vault/pictures
arend@server-002:/vault$ fio --name=zfs_seqwrite --rw=write --bs=4M --size=32G \
    --numjobs=1 --ioengine=libaio --iodepth=16 --direct=1 --end_fsync=1 \
    --filename=/vault/pictures/fio_testfile --output-format=normal
{...}
  WRITE: bw=485MiB/s (509MB/s), {...}

Not bad for secondhand spinning rust! In practice, disabling compression is not an option, because it wastes up to half of the disk space. Let’s illustrate that by running this small script:

#!/usr/bin/python3
import os
size = int(1.1 * 1024 * 1024)  # 1.1 MiB
for i in range(1, 101):
    with open(f"testfile_{i:02d}.bin", "wb") as f:
        f.write(os.urandom(size))
print(f"Created 100 files of {size} bytes ({size / 1024 / 1024:.1f} MiB) each")

in a newly created dataset with compression disabled:

arend@server-002:/vault$ sudo zfs create -o recordsize=1M vault/test
arend@server-002:/vault$ sudo zfs set compression=off vault/test
arend@server-002:/vault$ cd test
arend@server-002:/vault/test$ ./create_files.py 
Created 100 files of 1153433 bytes (1.1 MiB) each

110 MiB of data got stored. However, the actually used disk space is almost double that:

arend@server-002:/vault/test$ sudo du -sh /vault/test
202M	/vault/test

The reason for this discrepancy is that each of the 1.1 MiB files gets stored in two recordsize chunks, so 2 MiB per file. Of each second chunk, only a fraction is used, but ZFS allocates a full block regardless. This makes for a space efficiency of only 55 %!

This is not purely theoretical: many of my phone pictures are in this range. The solution is simple - turn on some form of compression. The weapon of choice here is lz4, which, while not as efficient as zstd, is a whole lot faster. On my system, lz4 compression gave a write speed of 495 MB/s, practically the same as compression disabled. Therefore I configured the pictures dataset to use lz4 compression and the others (where there’s actually something to compress) to use zstd-1, which shows an acceptable speed of 204 MB/s. To apply lz4 compression:

sudo zfs set compression=lz4 vault/pictures

In a summary table:

Compression	Continuous write speed [MB/s]
off	509
lz4	495
zstd-1	204
zstd-3	73

ARC Cache

ZFS relies heavily on caching reads to improve performance. The machine has 32 GB RAM, far too much for its current tasks, so let’s repurpose up to 16 GB of that for ZFS caching by adding the following line to /etc/modprobe.d/zfs.conf:

options zfs zfs_arc_max=17179869184

Of note is that too large values here can negatively affect performance because “free” RAM is also used by the Linux kernel to cache the root filesystem, and the kernel gives priority to ZFS cache. Also, keep in mind that ZFS’s write cache is separate. (ZFS batches writes, flushing to disk only every ~5 s, which it uses to reorder IO operations to a sequence that reduces seek time.)

Next up is making the pictures accessible over the network as well as running some sort of picture server along the likes of Immich. I’m also still figuring out the best way to make digiKam, the photo organizing application I use, work smoothly with this new set up. Both topics to be covered in a future article!

Be sure to share your questions, about this or upcoming articles, in the comments below.