Thursday, April 4, 2013

Cables Matter

No, I don't mean brand, or color, or even connector-type zealotry.. but when sizing solutions and working with customers and partners on new builds, I often find that either no thought goes into the SAS cabling, or only a little bit does. I find this distressing. Let me tell you why.

Tuesday, April 2, 2013

ZFS Intent Log

[edited 11/22/2013 to modify formula]

The ZFS Intent Log gets a lot of attention, and unfortunately often the information being posted on various forums and blogs and so on is misinformed or makes assumptions about the knowledge level of the reader that if incorrect can lead to danger. Since my ZIL page on the old site is gone now, let me try to reconstruct the knowledge a bit in this post. I'm hesitant to post this - I've written the below and.. it is long. I tend to get a bit wordy, but it is also a subject with a lot of information to consider. Grab a drink and take your time here, and since this is on Blogger now, comments are open so you can ask questions.

If you don't want to read through this entire post, and you are worried about losing in-flight data due to things like a power loss event on the ZFS box, follow these rules:
  1. Get a dedicated log device - it should be a very low-latency device, such as a STEC ZeusRAM or an SLC SSD, but even a high quality MLC SSD is better than leaving log traffic on the data vdevs (which is where they'll go without log devices in the pool). It should be at least a little larger than this formula, if you want to prevent any possible chance of overrunning the size of your slog: (maximum possible incoming write traffic in GB * seconds between transaction group commits * 3). Make it much larger if its an SSD, and much much larger if its an MLC SSD - the size will help with longevity. Oh, and seconds between transaction group commits is the ZFS tunable zfs_txg_timeout. Default in older distributions is 30 seconds, newer is 5, with even newer probably going to 10. It is worth noting that if your rarely if ever have heavy write workloads, you may not have to size it as large -- it is very preferably from a performance perspective that you not be regularly filling the slog, but if you do it rarely it's no big deal. So if your average writes in [txg_timeout * 3] is only 1 GB, then you probably only need 1 GB of log space, and just understand when you rarely overfill it there will be a performance impact for a short period of time while the heavy write load continues. [edited 11/22/2013 - also, as a note, this logic only applies on ZFS versions that still use the older write code -- newer versions will have the new write mechanics and I will update this again with info on that when I have it]
  2. (optional but strongly preferred) Get a second dedicated log device (of the exact same type as the first), and when creating the log vdev, specify it as a mirror of the two. This will protect you from nasty edge cases.
  3. Disable 'writeback cache' on every LU you create from a zvol, that has data you don't want to lose in-flight transactions for.
  4. Set sync=always on the pool itself, and do not override the setting on any dataset you care about data integrity on (but feel free TO override the setting to sync=disabled on datasets where you know loss of in-transit data will be unimportant, easily recoverable, and/or not worth the cost associated with making it safe; thus freeing up I/O on your log devices to handle actually important incoming data).
Alright, on with the words.