Pitfalls of direct IO with block devices?

I'm building a database on top of io_uring and the NVMe API. I need a place to store seldomly used large append like records (older parts of message queues, columnar tables that has been already aggregated, old WAL blocks for potential restoring….) and I was thinking of adding HDDs to the storage pool mix to save money.

The server on which I'm experimenting with is: bare metal, very modern linux kernel (needed for io_uring), 128 GB RAM, 24 threads, 2* 2 TB NVMe, 14* 22 TB SATA HDD.

At the moment my approach is:
– No filesystem, use Direct IO on the block device
– Store metadata in RAM for fast lookup
– Use NVMe to persist metadata and act as a writeback cache
– Use 16 MB block size

It honestly looks really effective:
– The NVMe cache allows me to saturate the 50 gbps downlink without problems, unlike current linux cache solutions (bcache, LVM cache, …)
– When data touches the HDDs it has already been compactified, so it's just a bunch of large linear writes and reads
– I get the REAL read benefits of RAID1, as I can stripe read access across drives(/nodes)

Anyhow, while I know the NVMe spec to the core, I'm unfamiliar with using HDDs as plain block devices without a FS. My questions are:
– Are there any pitfalls I'm not considering?
– Is there a reason why I should prefer using an FS for my use case?
– My bench shows that I have a lot of unused RAM. Maybe I should do Buffered IO to the disks instead of Direct IO? But then I would have to handle the fsync problem and I would lose asynchronicity on some operations, on the other hand reinventing kernel caching feels like a pain….

Leave a Reply