I had a quiet New Year's Eve and Day for the beginning of 2018. We had originally planned a trip away with my parents and some friends from southern California, but they all fell through -- my father was diagnosed with cancer late in 2017 and their trip to visit us in the U.S. was cancelled, and our friends work in medicine and wound up being on call. One of Lou's other friends came to visit us, instead: she was on a mission to experience midnight twice on January 1st by flying from Hong Kong to San Francisco. That might sound like an excuse to party hard, but instead we sat around an Ikea table playing board games, drinking wine and eating gingerbread. It was very pleasant.

On Monday (the 1st) I had the day off work for New Year's day, as is usual in most of the western world, so I slept in late. Lou and her friend decided to go to the wax museum and see several tourist attractions around SF, and I decided to pass the day at home reading. That afternoon, work chat started talking about a Tumblr post by pythonsweetness about an Intel hardware security bug. At the time I definitely did not suspect that this was going to occupy most of my working life for the next (almost) two months.

Like many people who work on system security, I had read Anders Fogh's post about a "Negative Result" in speculative execution research in July of 2017. At the time I thought it was an interesting writeup and I remember being glad that researchers were looking into this area. I sent the post to Bryan and asked him about his thoughts on it at the time, to which he replied saying that "it would be shocking if they left a way to directly leak out memory in the speculative execution". None of us seriously thought that there would be low-hanging fruit down that research path, but we also felt it was important that there was someone doing work in the area who was committed to public disclosure.

At first, after reading the blog post on Monday, we thought (or hoped) that the bug might "just" be a KASLR bypass and wouldn't require a lot of urgency. We tried to reach out to Intel at work to get more information but were met with silence. (We wouldn't hear back from them until after the disclosure was already made public.) The speculation on Tuesday intensified, until finally on Wednesday morning I arrived at the office to find links to late Tuesday night tweets revealing exploits that allowed arbitrary kernel memory reads.

Wednesday was not a happy day. Intel finally responded to our emails -- after they had already initiated public disclosure. I found a bottle of scotch on a shelf in the corner and had a glass or two in the afternoon once the papers were out. We all spent a lot of time reading. An arbitrary kernel memory read (an info leak) is not that uncommon as far as bugs go, but for the most part they tend to be fairly easy to fix. The thing that makes the Meltdown and Spectre bugs particularly notable is that in order to mitigate them, a large amount of change is required in very deep low-level parts of the kernel. The kind of deep parts of the kernel where there are 20-year old errata workarounds that were single-line changes that you have to be very careful to not accidentally undo; the kind of parts where, as they say, mortals fear to tread.

On Friday we saw the patches Matthew Dillon put together for DragonFlyBSD for the first time. These were the first patches for KPTI that were very straightforward to read and understand, and applied to a BSD-derived kernel that was similar to those I'm accustomed to working on.

To mitigate Meltdown (and partially one of the Spectre variants), you have to make sure that speculative execution cannot reach any sensitive data from a user context. This basically means that the pages the kernel uses for anything potentially sensitive have to be unmapped when we are running user code. Traditionally, CPUs that were built to run a multi-user, UNIX-like OS did this by default (SPARC is an example of such a CPU which has completely separate address spaces for the kernel and userland). However, x86 descends from a single-address-space microcontroller that has grown up avoiding backwards-incompatible changes, and has never really introduced a clean notion of multiple address spaces (segmentation is the closest feature really, and it was thrown out for 64-bit AMD64). Instead, operating systems for x86 have generally wound up (at least in the post-AMD64 era) with flat address space models where the kernel text and data is always present in the page table no matter whether you're in user or kernel mode. The kernel mappings simply have the "supervisor" bit set on them so that user code can't directly access them.

The mitigation is basically to stop doing this: to stop mapping the kernel text, data and other memory into the page table while we're running in userland. Unfortunately, the x86 design does not make this easy. In order to be able to take interrupts or traps, the CPU has to have a number of structures mapped in the current page table at all times (the IDT, GDT and TSS, as well as any LDTs if you're using those). There is also no ability to tell an x86 CPU that you want it to switch page tables when an interrupt occurs. So, the code that we jump to when we take an interrupt, as well as space for a stack to push context onto have to be available in both page tables. And finally, of course, we need to be able to figure out somehow what the other page table we should switch to is when we enter the kernel.

When we looked at the patches for Linux (and also the DragonFlyBSD patches at the time) on Friday and started asking questions, it became pretty evident that the initial work done by both was done under time constraints. Both had left the full kernel text mapped in both page tables, and the Linux trampoline design seemed over-complex. I started talking over some ideas with Robert Mustacchi about ways to fix these and who we should talk to, and reached out to some of my old workmates from UQ who were involved with OpenBSD. It seemed to me that the OpenBSD developers would care about these issues even more than we did, and would want to work out how to do the mitigation right.

I ended up sending an email to Philip Guenther on Friday afternoon, and on Saturday morning I drove an hour or so to meet up with him for coffee to talk page tables and interrupt trampolines. We wound up spending a good 6 hours at the coffee shop, and I came back with several pages of notes and a half-decent idea of the shape of the work to come.

One detail we missed that day was the interaction of per-CPU structures with per-process page tables. Much of the interrupt trampoline work is most easily done by using per-CPU structures in memory (and you definitely want a per-CPU stack!). If you combine that with per-process page tables, however, you have a problem: if you leave all the per-CPU areas mapped in all the processes, you will leak information (via Meltdown) about the state of one process to a different one when taking interrupts. In particular, you will leak things like %rip, which ruins all the work being done with PIE and ASLR pretty quickly. So, there are two options: you can either allocate the per-CPU structures per-process (so you end up with $NCPUS * $NPROCS of them); or you can make the page tables per-CPU.

OpenBSD, like Linux and the other implementations so far, decided to go down the road of per-CPU per-process pages to solve this issue. For illumos, we took the other route.

In illumos, it turned out that we already had per-CPU page tables. Robert and I re-discovered this on the Sunday of that week, sitting around the same Ikea table I played board games on the week before for New Year's. We use them for 32-bit processes due to having full P>V PAE support in our kernel (which is, as it turns out, relatively uncommon amongst open-source OS). The logic to deal with creating and managing them and updating them was all already written, and after reading the code we concluded we could basically make a few small changes and re-use all of it. So we did.

By the end of that second week, we had a prototype that could get to userland. But, when working on this kind of kernel change we have a rule of thumb we use: after the first 70% of the patch is done and we can boot again, now it's time for the second 70%. In fact it turned out to be more like the second 200% for us -- a tedious long tail of bugs to solve that ended up necessitating some changes in the design as well.

At first we borrowed the method that Matt Dillon used for DragonFlyBSD, by putting the temporary "stack" space and state data for the interrupt trampolines into an extra page tacked onto the end of *%gs (in illumos the structure that lives there is the cpu_t). We would check for kernel %cs and then swapgs when entering the trampolines and use %gs-relative addressing to find the structure members.

If you read the existing logic in interrupt handlers for dealing with %gs though, you will quickly notice that the corner cases start to build up. There are a bunch of situations where the kernel temporarily alters %gs, and some of the ways to mess it up have security consequences that end up being worse than the bug we're trying to fix. As it turns out, there are no less than 3 different ways that ISRs use to try to get to having the right cpu_t in %gs on illumos, as it turns out, and they are all subtly different. Trying to tell which you should use when requires a bunch of test logic that in turn requires branches and changes to the CPU state, which is difficult to do in a trampoline where you're trying to avoid altering that state as much as possible until you've got the real stack online to push things into.

I kept in touch with Philip Guenther and Mike Larkin from the OpenBSD project throughout the weeks that followed. In one of the discussions we had, we talked about the NMI/MCE handlers and the fact that their handling currently on OpenBSD neglected some nasty corner-cases around interrupting an existing trap handler. A big part of the solution to those issues was to use a feature called IST, which allows you to unconditionally change stacks when you take an interrupt.

Traditionally, x86 only changes the stack pointer (%rsp on AMD64) while taking an interrupt when there is a privilege level change. If you take an interrupt while already in the kernel, the CPU does not change the stack pointer, and simply pushes the interrupt stack frame onto the stack you're already using. IST makes the change of stack pointer unconditional. If used unwisely, this is a bad idea: if you stay on that stack and turn interrupts back on, you could take another interrupt and clobber the frame you're already in. However, in it I saw a possible way to simplify the KPTI trampoline logic and avoid having to deal with %gs.

A few weeks into the project, John Levon joined us at work. He had previously worked on a bunch of Xen-related stuff as well as other parts of the kernel very close to where we were, so he quickly got up to speed with the KPTI work as well. He and I drafted out a "crazy idea" on the whiteboard one afternoon where we would use IST for all interrupts on the system, and put the "stack" they used in the KPTI page on the end of the cpu_t. Then, they could easily use stack-relative addresses to get the page table to change to, then pivot their stack to the real kernel stack memory, and throw away (almost) all the conditional logic. A few days later, we had convinced each other that this was the way to go.

Two of the most annoying x86 issues we had to work around were related to the SYSENTER instruction. This instruction is used to make "fast" system calls in 32-bit userland. It has a couple of unfortunate properties: firstly, it doesn't save or restore RFLAGS, so the kernel code has to take care of this (and be very careful not to clobber any of it before saving or after restoring it). Secondly, if you execute SYSENTER with the TF ("trap"/single-step flag) set by a debugger, the resulting debug trap's frame points at kernel code instead of the user code where it actually happened. The first one requires some careful gymnastics on the entry and return trampolines specifically for SYSENTER, while the second is a nasty case that is incidentally made easier by using IST. With IST, we can simply make the debug trap trampoline check for whether we took the trap in another trampoline's code, and reset %cr3 and the destination stack. This works for single-stepping into any of the handlers, not just the one for SYSENTER.

To make debugging easier, we decided that traps like the debug/single-step trap (as well as faults like page faults, #GP, etc.) would push their interrupt frame in a different part of the KPTI state page to normal interrupts. We applied this change to all the traps that can interrupt another trampoline (based on the instructions we used). These "paranoid" traps also set a flag in the KPTI struct to mark it busy (and jump to the double-fault handler if it is), to work around some bugs where double-faults are not correctly generated.

It's been a long and busy two months, with lots of time spent building, testing, and validating the code. We've run it on as many kinds of machines as we could get our hands on, to try to make sure we catch issues. The time we've spent on this has been validated several times in the process by finding bugs that could have been nasty in production.

One great example: our patches on Westmere-EP Xeons were causing busy machines to throw a lot of L0 I-cache parity errors. This seemed very mysterious at first, and it took us a few times seeing it to believe that it was actually our fault. This was actually caused by the accidental activation of a CPU errata for Westmere (B52, "Memory Aliasing of Code Pages May Cause Unpredictable System Behaviour") -- it turned out we had made a typo and put the "cacheable" flag into a variable named flags instead of attrs where it belonged when setting up the page tables. This was causing performance degradation on other machines, but on Westmere it causes cache parity errors as well. This is a great example of the surprising consequences that small mistakes in this kind of code can end up having. In the end, I'm glad that that erratum existed, othertise it may have been a long time before we caught that bug.

As of this week, Mike and Philip have committed the OpenBSD patches for KPTI to their repository, and the patches for illumos are out for review. It's a nice kind of symmetry that the two projects who started on the work together after the public disclosure at the same time are both almost ready to ship at the same time at the other end. I'm feeling hopeful, and looking forward to further future collaborations like this with our cousins, the BSDs.

One of the more technically challenging parts of the RFD77 design is its layers of defences against side-channel attacks. This is particularly relevant to the soft-token component, which expects to perform cryptographic operations on the same physical hosts that run untrusted customer workloads, using key material that those customer workloads are never meant to be able to extract. A quick browse of the literature on timing, cache, DRAM side-channels and other kinds of information leakage between processes on the same CPU should quickly convince anybody that this is far from an easy task.

The first layer of defence, of course, is to make algorithm and code choices to prevent and reduce leaks at the source. To that end, the soft-token makes extensive use of Ed25519, ChaCha20 and other algorithms conceived and designed expressly with many common side-channel attacks in mind, wherever it is possible. We also made the call to ban the use of ECDSA on the host as part of the soft-token infrastructure, given that the at-large track record of ECDSA implementations and side-channel leakage has been vastly worse than other public-key algorithm families. In the points at which we still use RSA today, we have a plan for phasing it out (in favour of Ed25519), and we are careful to use state-of-the-art implementations that are kept up to date, to attempt to stay ahead of the problems.

As a second line of defence, aimed at the well-known Flush+Reload, Flush+Flush and other cache side-channel techniques that rely on (or at least are vastly easier with) shared memory pages, we eliminate the possibility to share physical page frames with the soft-token code. The illumos operating system has no support for Kernel Same-Page Merging or other features like it, so to do this we use a combination of static linking of cryptographic code into the soft-token binary; and a technique to change our code mappings to private duplicated pages.

The plan is to add this functionality to the illumos linker before shipping RFD77, but for now we use a hack:

static void
unshare_code(void)
{
    Dl_mapinfo_t mi;
    uint cnt;
    volatile char *ptr, *base, *limit;
    size_t sz;
    char tmp;
    intptr_t pgsz = sysconf(_SC_PAGE_SIZE);
    intptr_t pgmask = ~(pgsz - 1);

    bzero(&mi, sizeof (mi));

    /* Retrieve the list of mappings for our own object */
    VERIFY0(dlinfo(RTLD_SELF, RTLD_DI_MMAPCNT, &mi.dlm_acnt));
    mi.dlm_maps = calloc(mi.dlm_acnt, sizeof (mmapobj_result_t));
    VERIFY(mi.dlm_maps != NULL);
    VERIFY0(dlinfo(RTLD_SELF, RTLD_DI_MMAPS, &mi));

    for (cnt = 0; cnt < mi.dlm_rcnt; ++cnt) {

        /* We only care about executable mappings for now. */
        if ((mi.dlm_maps[cnt].mr_prot & PROT_EXEC) == PROT_EXEC) {
            ptr = mi.dlm_maps[cnt].mr_addr;
            sz = mi.dlm_maps[cnt].mr_msize;
            limit = ptr + sz;
            base = (volatile char *)((intptr_t)ptr & pgmask);

            VERIFY0(mprotect((caddr_t)base, sz,
                PROT_READ | PROT_WRITE | PROT_EXEC));

            /*
             * Touch every page in the mapping to make it private
             * to this process.
             */
            for (; ptr < limit; ptr += pgsz) {
                tmp = *ptr;
                *ptr = tmp;
            }

            VERIFY0(mprotect((caddr_t)base, sz,
                mi.dlm_maps[cnt].mr_prot));
        }
    }

    free(mi.dlm_maps);
}

This code uses the illumos linker introspection facilities (in the form of dlinfo(3c)) to find the mappings for the statically linked binary, then walks through them, changes each temporarily to R|W|X protection (we can't use R|W sadly because this might be the mapping containing the code we're running right now!), touches one byte in every page, then changes the mapping back to its original state.

This results in the kernel performing COW on all of the pages containing our code and giving us our own private copies of each. Now when we execute from these pages, we don't leak as much information as easily to other processes who might map the same binary out of the filesystem cache (since they will get the pages we originally had, instead of the new ones we now have after COW). We can also run this function every time we fork() to isolate ourselves from our child processes, which is quite handy. This technique isn't perfect by any means, but it does take quite a few of these well-known attacks and make them much less easy.

We also make certain that the binaries running as part of the soft-token are not visible in the filesystem of any non-global zone.

As a third layer of defence against CPU-cache-centric side-channel attacks, the RFD77 design proposes the use of Intel CAT. This mitigation is aimed squarely at attack families such as Prime+Probe which do not require shared pages with the target. In the CATalyst paper by Fangfei Liu et al, they show that the Intel CAT can be used to effectively divide the CPU cache into sub-segments and "pin" pages into it so they do cannot be pushed out. We intend to use an implementation of their technique for this third mitigation measure. We haven't implemented this functionality as yet, though we have made some careful study of the Intel documentation on the feature and looked at how it will fit into the kernel. The plan currently is to set aside some of the system's CPU cache for the exclusive use of the soft-token.

As the small number of pages containing the critical cryptographic code and data will be pinned into the dedicated CPU cache segment and cannot leave it, this method should also provide some protection against the DRAM buffer side-channel discussed by Michael Schwartz et al in their "Malware Guard Extension" paper. This will come at a high performance cost for the soft-token code — however, we still believe based on our analysis and benchmarks so far that this will be far more cost-effective than buying dedicated high-performance hardware cryptographic modules for every machine in a large cloud installation.

Finally, a note on something that is not a cache side-channel defence mechanism: Intel SGX. We do plan to use SGX to help make harvesting keys in bulk very difficult even for an attacker who has gained control of the hypervisor kernel, but it does not constitute any kind of serious defence against side-channel attacks in and of itself (see e.g. the discussion in Schwartz et al's paper). We are not planning to expose SGX to end-users on SmartOS or SDC at any point soon.

One of the very nice parts about the RFD77 design that makes me a little less scared about attacks like these is that we are not seeking to defend the soft-token entirely against a compromised kernel (we only seek to make it expensive enough to defeat that going to the hardware PIV module will be just as fast). We are also in the position of being able to alter the operating system itself very widely to support our goals. Timing side-channels on a shared CPU are still vaguely terrifying when working at a company built around sharing CPUs between users, but with these plans in place I'm happy to say that I don't think it will be our weakest link the security of this part of the system.

The first stage of the RFD77 soft-token was fairly straightforward, all things considered: copy-pasting the agent.c code from OpenSSH; making a forking libsysevent(3LIB) subscriber that watched for zones coming and going and forked it off; and writing a PIV client to sit in the middle and decrypt the keys with a hardware token. When the agent code wanted to use a key, it would send a simple fixed-length request struct on a pipe up to the PIV client, which would decrypt the key and write it into a shared memory segment then reply back down the pipe. This meant no locks between processes and no variable length data being exchanged, which makes writing a robust implementation a lot simpler.

Certificates, however, involve a fair bit more back-and-forth. They need to be periodically renewed, and in the case of X.509 certificates, they have a potentially variable length chain back to their CA. Parsing and generating them is also vastly more complex, involving a lot of ASN.1 and complex structures.

My initial strawman design was to handle renewal entirely in the "supervisor" (PIV) layer and have the agent just shovel whatever cert chain was loaded into a second shared memory segment out to clients. This runs quickly into a problem, however: if the supervisor is updating the shared memory whenever it likes, there needs to be a lock so that the agent doesn't read partially updated data, and interprocess locking is somewhat easy to mess up (and previous to this I had avoided it successfully in the design).

So, instead, I decided to make certificate renewal happen only at the behest of the agent process. It sends another fixed-length request to the supervisor asking for the cert chain for a given key to be renewed, and waits for a reply from the supervisor before going to use the shared memory segment again.

The shared memory segments each look like this:

struct token_slot_data {
    volatile uint32_t tsd_len;
    volatile char tsd_data[1];
};

A simple length-prefixed data segment. Every time we use it we VERIFY the tsd_len to fit within the bounds of the memory region before doing any work. Also, the supervisor layer is carefully written to never ever read any of the fields in the shared memory segment. It only ever blindly overwrites them. This enforces the trust hierarchy between the supervisor and the agent child (which runs with very few privileges). The token_slot_data structs are pointed to by the struct token_slot:

struct token_slot {
    uint8_t ts_id;
    enum slot_type ts_type;
    enum slot_algo ts_algo;
    const char *ts_name;
    struct token_slot *ts_next;
    struct token_slot_data *ts_data;
    struct token_slot_data *ts_certdata;
    struct token_slot_data *ts_chaindata;
    size_t ts_datasize;
    struct sshkey *ts_public;
    nvlist_t *ts_nvl;
    struct agent_slot *ts_agent;
};

As you can see, we have 3 data segments associated with each "slot" -- one for the key itself (ts_data), one for the direct certificate (ts_certdata) and one for the certificate chain (ts_chaindata).

The certificate and chain shared segments are of a fixed maximum length: MAX_CERT_LEN (8kb) and MAX_CHAIN_LEN (16kb) respectively. Given that most real certificates are under about 2kb each and we expect to have a max chain length of about 4 certificates, this seems quite reasonable. The pages allocated for these segments don't have the guard pages at the start and end of the mapping -- since they don't contain any private key material it's not as important to protect them from memory overread.

Currently the implementation just assumes that the chain segment contains exactly one certificate. This works fine for the simple case where our cert is signed directly by our own host node only and the chain goes no further. (In the zone soft-token we have the cert.key ou=instances cert in the cert slot, and then the ou=nodes cert for the global zone soft-token cert.key in the chain slot. In the global soft-token the chain slot is empty.)

The elephant in the room, then, is how to expand this approach to deal with chains back to a head node from an ordinary CN in Triton. Currently my thoughts for this centre around making a "CertAPI" headnode service that the CNs can talk to to request signing of their certificates. The client to talk to this from the CN global zone though and how to integrate it into the soft-token is still something to be considered.

One option I've considered is making the global soft-token agent code support writing a new chain through the normal socket interface. Then a secondary process (perhaps written in node) can sit off to the side to manage conversing with CertAPI, obtaining the new chain, and uploading it into the soft-token. This has the notable advantage of not requiring much complex code to run inside the soft-token process, which is something I like. It does, however, make the cert renewal process partly push- and partly pull-based, which seems prone to racey errors in implementation. Another thing I've considered is making another fork of the soft-token process for this communication, with a shared memory segment to write into like the agent process. Then it could still be pull-based, as the PIV supervisor requests the renewal from the certAPI client process when the agent asks for a cert renewal itself.

In any case, I'm going to continue prototyping and testing with certificates directly on each node for the time being, and start work on the gossip service. Certificates chaining back to the headnode from a CN is an end-user feature and not required for the internal use of RFD77 for Triton headnode services, so there's plenty of work that can be done before having to come back and solve this problem.