How Does Tinfoil Compare to Apple Private Cloud Compute?
In a previous blog post, we introduced Tinfoil enclaves and explained how they serve as building blocks for our platform. We also explained how Tinfoil enables the deployment of private AI applications with verifiable security. However, the many technical details and moving parts can make it difficult to understand how Tinfoil differs from existing solutions for confidential AI.
Trusted execution environments (TEEs) are at the core of Tinfoil. TEEs are low-level hardware isolation primitives that have been around for a while. Building on top of these mechanisms, many different types of abstractions have been proposed to interface with the isolated software.
Secure enclave platforms usually aim to provide a bare metal environment to run a process. On the other hand, confidential computing and trusted VMs solutions aim to isolate entire virtual machines. Software infrastructure can be built to use some mechanisms for a different abstraction.
This creates an interesting design space, with platforms such as Azure Confidential Computing, AWS Nitro Enclaves, or the recent Apple Private Cloud Compute. These offers primarily differing along two axes: security and level of abstraction.
So back to the question: “How is Tinfoil different?” The short answer is that we focus on a specific application, private AI, which makes it possible to get the best of both worlds. We can build on the strongest hardware-based isolation primitives while making it easy to use safely via a clean abstraction tailored specifically to private AI workloads.
Security: Reducing Trust
Trust can be evaluated in multiple ways: the number of trusted parties, the size of the trusted code base, or the types of attacks considered. This means that different solutions are not strictly ordered. Nevertheless, we can roughly compare them to one another.
But first, let’s pop up a level and consider the setting that Tinfoil operates in. Tinfoil is a cloud computing setup, which means we have some guest software (a function, a process, or an entire virtual machine) that is running on a host machine. This host machine includes all the privileged software to orchestrate, isolate, and manage the guest software.
A Brief History of TEEs
Trusted execution environments have a long history, dating back twenty years with several academic proposals1 and the development of the Trusted Platform Module (TPM). The end goal was always the same: to enforce the privacy and integrity of a piece of software, even in the presence of an “attacker” controlling the host. In this threat model, the attacker controls the entire privileged software stack (operating systems, hypervisor, etc.) as well as the host machine itself. This is an extremely strong threat model unmatched by any other isolation primitives. For instance, process-isolation or virtualization protects a workload from other process or VMs but not from the host. Sandboxes or classic containers add extra isolation to protect the host from untrusted guests. TEEs are a step up from this: they protect the guest from a malicious host.2
Hypervisor-Based TEEs
At first, processors did not contain hardware mechanisms to completely isolate guest software. Instead, the first generation of TEEs relied on a small but trusted hypervisor (not controlled by the host) to enforce isolation between the host and guests.3 This light hypervisor is only expected to perform checks on security-critical operations and delegates all complex logic (e.g., memory-page management) to a full-fledged host-controlled operating system. This is the underlying architecture used by Apple Cloud Compute and Amazon Nitro Enclaves.
An important assumption here is that the mini-hypervisor is trusted. Apple and AWS probably have the resources to tailor and harden these key pieces of software. Nevertheless, at the time of writing, while Apple offers the option to inspect corresponding binaries, both projects remains closed source.4
Despite this lack of code transparency, the Nitro hypervisor has a pristine history when it comes to security. The AWS team is reputed for leveraging a modular architecture which enables formal verification at scale. Moreover, the Nitro hypervisor also enforces isolation between EC2 instances and the rest of the AWS infrastructure. For these reasons, we picked Nitro enclave as our underlying TEE to build our lightweight Tinfoil Dev environment.
Nevertheless, in these existing solutions, the guest memory is not encrypted, Apple and AWS's trust assumptions suffer from closed ecosystems where trust cannot be easily distributed, and despite Apple leveraging some remote attestation mechanism, code transparency is limited and no mechanisms exist to match open source code with what the end-client sees.
However, when it comes to Tinfoil Prod, we want stronger guarantees. In contrast to these existing solutions, Tinfoil provides client-side verifiability against the open source code and encrypts all memory at all times.
Memory-Encrypted TEEs
Starting in 2015 with Intel SGX, we saw the introduction of TEE mechanisms fully implemented in hardware which make it possible to reduce the amount of trusted code. These hardware mechanisms add an extra layer of security by encrypting the guest memory, so it is not accessible to the (now untrusted) hypervisor, nor to a physical attacker that could access the DRAM.5 This protects against entire new classes of attacks. While no system is ever perfectly secure,6 hardware TEEs provide the strongest level of guest isolation and security available in cloud computing today.7
NVIDIAs latests GPUs, the Hopper and Blackwell architectures, are equipped with confidential computing capabilities. This means they can be configured to extend a TEE with computational ressources sufficient to run the largest AI models with negligible performance overhead. GPU memory is encrypted, as is the connection with the TEE8.
Tinfoil Dev vs Tinfoil Prod
Our Tinfoil Prod platform uses a combination of hardware-based TEEs (AMD SEV-SNP and Intel TDX) and NVIDIA GPUs with confidential computing enabled. It removes any trust assumption in an hypervisor and encrypt guest memory. This makes Tinfoil Prod provide stronger security guarantees compared to Tinfoil Dev by providing the best-possible guest isolation in the cloud.
Usability: Finding the Right Abstraction
So why aren't TEEs used everywhere?
As is often the case in complex systems, the exposed abstraction is often more important than the underlying technology.9
First, hardware enclaves like SGX only offered a bare metal environment for developers. No syscalls, no runtime, no libc! This made SGX difficult to use, and despite efforts to improve developers experience through better OS support,10 it failed to achieve widespread adoption. Other abstractions saw more success, such as trusted VMs or confidential computing as provided by Azure and other major clouds. Here the abstraction matches normal VMs, giving customers SSH access to a guest VM with an extra layer of protection against the host machine and the cloud provider. Nevertheless, when looking at deploying private AI applications where only the end-user should ever have access to their data, this abstraction is somewhat of a mismatch. The application providers only have access to the guest, which makes it difficult for them to manage their application while guaranteeing they cannot access their user’s data.
Finally, AWS Nitro enclaves offer a third abstraction: enclaves are packaged as containers and can be launched from a parent instance without providing any accesses inside the guest. This makes it a good fit for our application, if it was not for the fact that Nitro enclaves cannot leverage GPUs.
Private-AI as a Service
With Tinfoil, we provide a new abstraction that is tailored to deploy AI applications: developers simply need to select a model, add some logic to connect it to generic (but privacy-preserving) capabilities such as memory, evaluation. or telemetry, and then get to deploy their AI service with the guarantee that the Tinfoil endpoints are private, stateless, and that only the end-user will ever have access to their personal data. We believe this service model is to AI applications, what function-as-a-service has been to cloud computing: a native abstraction that directly fits the use case and simplifies many complex problems, starting, in our case, with user privacy.
How is Tinfoil different
In summary, we’ve developed an entire platform and AI stack to make it easier than ever to use trusted execution environments with the “right” abstraction to deploy private AI applications. We do so by leveraging hardware-based TEEs that guarantee the strongest levels of guest isolation available in the cloud. Beyond isolation, we use a series of mechanisms to reduce trust and make it possible for the end-users to verify these security guarantees on the fly.11
Footnotes
-
Early academic proposal include XOM or Aegis. The TPM was the first widely available piece of hardware that would commoditize trusted computing. ↩
-
TEEs also provide code-integrity guarantees in addition to isolation, but for the rest of this blog post, we’ll primarily focus on the isolation guarantees. ↩
-
This isolation between trust domains is enforced using separate page tables (hence separate memory spaces) and other virtualization mechanisms sometimes implemented in hardware for performance purposes. ↩
-
To the best of our knowledge, Apple publishes some hashes and binaries but provides no source code for them, with the exception of some small security modules. ↩
-
In general, physical attacks are among the hardest to mount (they require physical access after all) with the exception of ColdBoot attacks. They make it possible for an attacker to read out the content of the DRAM by cooling it using a freeze spray and plugging it onto a different machine under the attacker’s control. ↩
-
Some advanced microarchitectural attacks are still possible (as on any other cloud platform), but these do not represent the main privacy concern for running AI workloads in the cloud. We will detail microarchitectural side channels and their implications in a future blog post. ↩
-
Cryptographic solutions such as multiparty-computation or fully homomorphic encryption offer better security based on mathematical assumptions. Nevertheless, they incur significant (10^4 - 10^10) performance overhead, making them impractical for large AI workloads. ↩
-
Remote attestation also makes it possible to confirm the authenticity and correct operation of the NVIDIA GPU. ↩
-
In the late 60’s / early 70’s, computer hardware and systems where growing in complexity and delivering more power and capacities than ever before. Nevertheless, the lack of correct abstraction to these systems lead to a Software Crisis. This was solved by the invention object-oriented programming, which gave a clear abstraction to tie data and programs. ↩
-
Gramine, previously Graphene or LibOS aimed at making off-the-shelf applications run inside of SGX. ↩
-
See our detailed blog post on these mechanisms and how we manage trust. ↩
Subscribe for Updates
Stay up to date with our latest blog posts and announcements.