How I Store Data

I have a problem when it comes to data - I have too much of it. I’ve been burned multiple times by losing access to some piece of content that was online. I’ve also lost data to a cloud provider changing policies, locking my account accidentally or having their own data loss.

So I took it upon myself to own and backup all my own data. So far that’s actually gone really well, and as we’ve seen more and more privacy issues stemming from the cloud it’s reinforced that decision.

Like many people, I started with an off the shelf commercial NAS (Network Attached Storage) device - a little four bay QNAP. It was.. ok. It lost time machine data a few times, was underpowered for tasks like running a little media server and generally lacked the ability to expand it. At this point I had around 800gb of data.

So I custom built a machine, using mid-grade hardware. I popped in some hard drives, ran FreeNAS (a bsd distro) and started storing more data. This first server survived a move, and got me up to about 4tb of storage.

Most of my issues with this first server were really issues with FreeNAS. I wanted to run more software to do useful things with my data, and repeatedly hit issues either running software directly on via “jails” (lightweight isolated environments. At this point it was pretty obvious everyone liked Docker, and ZFS went from a niche FreeBSD feature to getting serious Linux support.

So I rebuilt the server. More drives, more hardware and I moved to Ubuntu using ZFS for my data storage drives and using Docker Compose for all of my software. I’m still using that same server today and while I’ve made several tweaks to my setup, it’s basically the same.

Wait, so what’s so special about ZFS?

ZFS is a filesystem that is meant to span multiple hard drives whether by striping (splitting data across drives for speed) or cloning (duplicating data for redundancy). It’s fast, can handle failing drives well, supports using cache drives (e.g. an ssd to speed up a set of spinning disk drives) and has built in snapshotting and restores - making backups amazingly easy.

Hard drives fail.. actually quite often. A well used drive can die within 5 years, and certainly many will perish within 10. If you buy your drives in pairs, you may lose both in a very short window. Assuming we care about the data we’re storing, then we want to preserve it even when things go a bit wrong. ZFS is excellent for this both directly in letting us duplicate data across drives and preventing bit rot, and it’s snapshots allow us to encrypt and do partial backups with ease. When things go wrong, we can also restore easily.

Compared to a traditional hardware raid, ZFS gives more flexibility and doesn’t require raid controllers or hardware tweaks. We avoid the issues of trying to load raid drivers at boot or other assinine nonsense we might traditionally associate with raid setups. We can also expand pools and add cache or hot spares with ease.

The only downside to ZFS compared to hardware raid is limited OS support - Linux and BSD only - and we need more memory. ZFS typically works best with at least 8gb of ram, and preferably more. The other downside to ZFS is that it really helps if your pool is made up of identical drives, meaning it’s best to buy multiple of the same drive when building a new pool.

The Hardware

Ryzen 3600 Processor
32gb DDR4-3200
256gb Samsung 970 Evo M.2
2x 6tb WD Red
2x 4tb WD Red
240gb ssd
Gold Rated 750W PSU

Aside from a decent spot of ram, these are actually very mild mannered specs. Nothing here is screaming fast and we’re using a very middle of the road processor. We want to minimize idle power usage (which is why we selected modern AMD processors and a good power supply), and we want to keep our drive slots for our storage drives.

Our M.2 drive isn’t used for speed, but instead it’s for the OS and to keep all of our drive slots free. It’s inconvenient if we lose the OS drive but it’s not as bad as data loss. I do script all my system bootstrapping, so rebuilding the OS is really just a twenty minute task.

The Software

Almost everything runs in Docker Compose. This lets us specify all of our software as a set of yaml files we can keep in git, and lets us port that software anywhere we want.

Jellyfin - my preferred media server
Plex - used by some family members outside of my home
Syncthing - synchronizes data
Heimdall Dashboard - helps glue all the apps together in a single nice pane of glass
Ubooquity - an ebook and comic server, great for tablet devices
Gitea - personal github replacement
ArchiveBox - tool for archiving websites faithfully
NodeRed - personal automation and scripting tool, used heavily for scraping sites
Traefik - frontend proxy to help glue all the different services together under a single url and handle ssl
HomeAssistant - open source home automation
Prometheus, Promtail and Grafana - Dashboard for monitoring everything

Outside of this server, I also run PiHole which both provides dns resolution and blocks ads across the entire house. I run pihole on two dedicated Raspberry Pis to help with uptime.

Application data lives in a dedicated ZFS dataset, so it can be snapshotted and backed up more frequently than bulk data like media.

RClone is used to backup the cloud hosts I use on a nightly basis to the server. Syncoid is used to manage snapshots.

Data Organization

I keep four top level datasets:

Apps - Application data
Media - Movies, tv, music, comics, books, etc
Home - Personal directories, and where backups of other systems live
Hoard - Large archives, usually of data that has disappeared or is at risk of disappearing from the internet

Hoard is broken down by subject matter, while Media is broken down by type. Home is broken down by user.

Media is usually the largest, currently sitting at almost 4tb alone. Hoard occupies 2tb, and Home is about 1.5tb. Apps is <10gb.

Backups

My current backup strategy is a rotating assortment of external drives that are kept in a fireproof safe. This isn’t ideal since it requires manual steps to perform a backup, the external drives are easily damaged, and I have to carefully catalogue and track drives. The upside is that it’s easy to bring a copy of my data in an emergency or trade drives with a friend.

I’m currently building a dedicated backup machine that will regularly receive ZFS snapshots. It’s an ongoing process and it’s using an older rackmount server, so it may end up being too power hungry or loud.

Other Machines

We’re a device heavy household. We have several tablets, personal laptops, gaming desktops, phones, etc. All of those devices are set to backup to the server on the regular, in each person’s home directory.

I also rent a few servers. One for gaming with friends, another for side projects. The homeserver reaches out to get stats and logs from them, and they are covered by the dashboard. Those servers also run all their software via Docker Compose.

Is it worth it?

Pre-pandemic, I was a bit more on the fence about it. It was a hobby and entertaining, but the utility wasn’t entirely obvious. Once the pandemic hit and we also had several other minor disasters, the benefit has been clear. We’ve had movies during days long internet outages, I have archives of websites that no longer exist, and I have a single place for automated tasks to run.

Now that lockdown is over, we’re trying to find ways to use the server outside the home. Currently I have OpenVPN setup on the router, but Wireguard support is quickly coming and I’ll likely migrate as soon as I am able to. On vacation we find we’re on vpn most of the time, if for adblocking alone.

That being said, it’s a fair bit of work and not trivial effort. If it’s a hobby and you enjoy it, then it’s likely worth it. If you would rather keep things simple, a commercial device may be your best bet; though keep in mind how insecure commercial devices tend to be.