Nex7's Blog: 2013

Wednesday, October 9, 2013

Lots of L2ARC not a good idea, for now

Before I get into it, let me say that the majority of the information below and analysis of it comes from one of my partners in crime (coworker at Nexenta), Kirill (@kdavyd). All credit for figuring out what was causing the long export times, and reproducing the issue internally (it was initially observed at a customer) after we cobbled together enough equipment, and analysis to date, goes to him. I'm just the messenger, here.

Here's the short version, in the form of the output from a test machine (emphasis is mine):

root@gold:/volumes# time zpool export l2test

real 38m57.023s
user 0m0.003s
sys 15m16.519s

That was a test pool with a couple of TB of L2ARC, filled up, with an average block size on pool of 4K (a worst case scenario) -- and then simply exported, as shown. If you are presently using a pool with a significant amount of L2ARC (100's of GB's or TB's), (update:) and you have multiple (3+) cache vdevs, and have any desire to ever export your pool that has a time limit attached to it before you need to be importing it again to maintain service availability, you should read on. Otherwise this doesn't really affect you. In either case, it should be fixable, long-term, so probably not a panic-worthy piece of information for most people. :)

Cables Matter

No, I don't mean brand, or color, or even connector-type zealotry.. but when sizing solutions and working with customers and partners on new builds, I often find that either no thought goes into the SAS cabling, or only a little bit does. I find this distressing. Let me tell you why.

ZFS Intent Log

[edited 11/22/2013 to modify formula]

The ZFS Intent Log gets a lot of attention, and unfortunately often the information being posted on various forums and blogs and so on is misinformed or makes assumptions about the knowledge level of the reader that if incorrect can lead to danger. Since my ZIL page on the old site is gone now, let me try to reconstruct the knowledge a bit in this post. I'm hesitant to post this - I've written the below and.. it is long. I tend to get a bit wordy, but it is also a subject with a lot of information to consider. Grab a drink and take your time here, and since this is on Blogger now, comments are open so you can ask questions.

If you don't want to read through this entire post, and you are worried about losing in-flight data due to things like a power loss event on the ZFS box, follow these rules:

Get a dedicated log device - it should be a very low-latency device, such as a STEC ZeusRAM or an SLC SSD, but even a high quality MLC SSD is better than leaving log traffic on the data vdevs (which is where they'll go without log devices in the pool). It should be at least a little larger than this formula, if you want to prevent any possible chance of overrunning the size of your slog: (maximum possible incoming write traffic in GB * seconds between transaction group commits * 3). Make it much larger if its an SSD, and much much larger if its an MLC SSD - the size will help with longevity. Oh, and seconds between transaction group commits is the ZFS tunable zfs_txg_timeout. Default in older distributions is 30 seconds, newer is 5, with even newer probably going to 10. It is worth noting that if your rarely if ever have heavy write workloads, you may not have to size it as large -- it is very preferably from a performance perspective that you not be regularly filling the slog, but if you do it rarely it's no big deal. So if your average writes in [txg_timeout * 3] is only 1 GB, then you probably only need 1 GB of log space, and just understand when you rarely overfill it there will be a performance impact for a short period of time while the heavy write load continues. [edited 11/22/2013 - also, as a note, this logic only applies on ZFS versions that still use the older write code -- newer versions will have the new write mechanics and I will update this again with info on that when I have it]
(optional but strongly preferred) Get a second dedicated log device (of the exact same type as the first), and when creating the log vdev, specify it as a mirror of the two. This will protect you from nasty edge cases.
Disable 'writeback cache' on every LU you create from a zvol, that has data you don't want to lose in-flight transactions for.
Set sync=always on the pool itself, and do not override the setting on any dataset you care about data integrity on (but feel free TO override the setting to sync=disabled on datasets where you know loss of in-transit data will be unimportant, easily recoverable, and/or not worth the cost associated with making it safe; thus freeing up I/O on your log devices to handle actually important incoming data).

Alright, on with the words.

(IPMP vs LACP) vs MPIO

If you're running an illumos or Solaris-based distribution for your ZFS needs, especially in a production environment, you may find yourself wanting to aggregate multiple network interfaces either for performance, redundancy, or both. With Solaris, your choices are not limited to standard LACP.

ZVOL Used Space

One of the more common complaints I hear about with ZFS is when clients are using zvol's to offer up iSCSI or FC block targets to clients, formatting filesystems on them, and then using them. Well, I don't get complaints about that, so much.. what I get are complaints about how, over time, the 'used' space as visible in ZFS is completely out of whack with the used space as visible in the filesystem on the client.

Nexenta - SMTP Server Settings

A common problem for users of NexentaStor, especially home users and people doing a trial evaluation of the software, is that, at least as of 3.1.3.5, it requires a real SMTP server somewhere else in order to send email. I'm often told no email was set up because no SMTP server or account was available (at home, this is often just permanently true, and in the enterprise this is often true during eval phases). This is bad - NexentaStor sends all sorts of alerts via email that it does not display anywhere else. Your appliance may be warning you of something, and you'll have no idea. There is a workaround, however, if the appliance can get to the internet and you have GMail (or can be bothered to set up an account)!

Reservation & Ref Reservation - An Explanation (Attempt)

So in this article I'm going to try to explain and answer a lot of the questions I get and misconceptions I see in terms of ZFS and space utilization of a pool. Not sure how well I'm going to do here, I've never found a holy grail way of explaining these that everyone understands - seems like one way works with one person, and a different way is necessary for the next guy, but let's give it a shot.

VM Disk Timeouts

A pretty common issue to run into when using some SAN back-ends for virtual machines is that the VM's end up crashing, BSOD'ing, or (most commonly) remounting their "disks" read-only when there's a hiccup or failover in the storage system, often resulting in a need to reboot to restore functionality.

ZFS: Read Me 1st

Things Nobody Told You About ZFS

Yes, it's back. You may also notice it is now hosted on my Blogger page - just don't have time to deal with self-hosting at the moment, but I've made sure the old URL redirects here.

So, without further adieu..

Nex7's Blog

Wednesday, October 9, 2013

Lots of L2ARC not a good idea, for now

Thursday, April 4, 2013

Cables Matter

Tuesday, April 2, 2013

ZFS Intent Log

Saturday, March 30, 2013

(IPMP vs LACP) vs MPIO

Tuesday, March 26, 2013

ZVOL Used Space

Monday, March 25, 2013

Nexenta - SMTP Server Settings

Saturday, March 23, 2013

Reservation & Ref Reservation - An Explanation (Attempt)

VM Disk Timeouts

Thursday, March 21, 2013

ZFS: Read Me 1st

Things Nobody Told You About ZFS