浅析 Linux 程序的 Canary 机制

2022-08-25

字数统计: 3.5k | 阅读时长≈ 14 分钟

一、简介

一直都比较好奇 Canary 在 Linux 中的实现，但没什么心思去具体了解它的实现。这种好奇心在得知可以通过修改子线程的线程局部存储来达到篡改 canary 目的时达到了高峰，于是想好好去研究一下。

太久没写博客了，这里就简单记录一下。

二、什么是 Canary

Canary 是一种栈保护机制，用于在函数返回时检测当前栈是否被破坏。当函数调用压入新栈帧时，编译器会在新栈帧的栈底放一个随机值，并在函数返回退出栈帧时检查这个随机值是否被破坏。如果被破坏则说明当前存在栈溢出，程序退出：

有意思的是，为了防止 canary 被 printf 等字符串输出函数泄露，canary 的最低位始终为 /x00。

当 Canary 验证失败时，编译器会要求调用 __stack_chk_fail 函数。应用层在触发 canary 异常时所调用的 __stack_chk_fail 函数实现在 glibc 中，该函数会打印一些信息并终止程序。由于该函数在输出信息时会根据 argv[0] 来输出程序路径，因此如果栈溢出长度可控的话，则攻击者可以控制栈底的 argv[0] 指针，利用 __stack_chk_fail 的触发来泄露信息。

注意 Canary 在 Linux 内核中也有应用，若在执行 Linux 内核代码时触发了栈溢出，则控制流将调用位于内核的 __stack_chk_fail 函数，该函数实际调用 panic 以终止内核执行。不过内核的 canary 使用已经有了现成的文章，因此这里不再赘述。

三、深入 glibc

这里参考的是 glibc-2.23，虽然版本偏老但是原理还是不变的。

先一步一步来分析。

1. Canary 来源

在 csu\libc-start.c 中的 __libc_start_main 函数中，可以找到 Canary 的赋值语句：

  /* Set up the stack checker's canary.  */
  uintptr_t stack_chk_guard = _dl_setup_stack_chk_guard (_dl_random);
# ifdef THREAD_SET_STACK_GUARD
  THREAD_SET_STACK_GUARD (stack_chk_guard);
# else
  __stack_chk_guard = stack_chk_guard;
# endif

其中，_dl_random 是一个存放来自内核的随机数的地址：

1 2	/* Random data provided by the kernel. / void _dl_random;

这个内核的随机数如果要细究初始化的时间点的话，那只能说是在加载动态链接器之前（一个特别早的时间点）完成，其栈回溯如下：

elf\rtld.c: RTLD_START 宏：动态链接器主入口。

sysdeps\x86_64\dl-machine.h: RTLD_START 宏具体 asm 定义：动态链接器的实现涉及汇编，因此需要根据对应的架构来实现不同汇编代码的动态链接器。从注释和代码中可以得知，动态链接器会先调用 _dl_start_user来做一些初始化，之后将控制流跳转至用户程序的 ELF entry 地址：

/* Initial entry point code for the dynamic linker.
  The C function `_dl_start' is the real entry point;
  its return value is the user program's entry point.  */
#define RTLD_START asm ("\n\
.text\n\
  .align 16\n\
.globl _start\n\
.globl _dl_start_user\n\
_start:\n\
  movq %rsp, %rdi\n\
  call _dl_start\n\
_dl_start_user:\n\

  ...

  # And make sure %rsp points to argc stored on the stack.\n\
  movq %r13, %rsp\n\
  # Jump to the user's entry point.\n\
  jmp *%r12\n\
.previous\n\
");

elf\rtld.c: _dl_start -> _dl_start_final -> _dl_sysdep_start 函数：_dl_sysdep_start 函数会调用一些平台依赖函数来做初始化等等，并调用 dl_main 函数来获取具体的用户程序 entry 地址。不过这个函数我们的重点不在于刚刚说的那些操作，而是这个 for 循环：

ElfW(Addr)
_dl_sysdep_start (void **start_argptr,
     void (*dl_main) (const ElfW(Phdr) *phdr, ElfW(Word) phnum,
          ElfW(Addr) *user_entry, ElfW(auxv_t) *auxv))
{
  ...
  DL_FIND_ARG_COMPONENTS (start_argptr, _dl_argc, _dl_argv, _environ,
         GLRO(dl_auxv));
  for (av = GLRO(dl_auxv); av->a_type != AT_NULL; set_seen (av++))
    ...
   case AT_RANDOM:
   _dl_random = (void *) av->a_un.a_val;
   break;
    ...
  ...
}

start_argptr 是一个指向调用动态链接器 argc, argv, env, auxv 数据的指针，而DL_FIND_ARG_COMPONENTS宏就是把这些数据一个个分门别类放到对应的变量 _dl_argc、_dl_argv、_environ、_dl_auxv 上去。即可以得知该动态链接器被调用的参数除了我们最熟悉的三个以外，还多了一个 auxv。

这个多出来的 auxiliary vector 参数是一个存放辅助程序执行的数据数组，至关重要。该参数里存放了很多有用的信息。这里我们只关心 AT_RANDOM，即来自内核的随机数。这个随机数就是在这里被赋值给 _dl_random 变量用于生成 canary 。

回到 __libc_start_main 函数，在获取到随机数变量后，实际生成 canary 的逻辑如下：

// sysdeps\unix\sysv\linux\dl-osinfo.h
static inline uintptr_t __attribute__ ((always_inline))
_dl_setup_stack_chk_guard (void *dl_random)
{
  union
  {
    uintptr_t num;
    unsigned char bytes[sizeof (uintptr_t)];
  } ret;

  /* We need in the moment only 8 bytes on 32-bit platforms and 16
     bytes on 64-bit platforms.  Therefore we can use the data
     directly and not use the kernel-provided data to seed a PRNG.  */
  memcpy (ret.bytes, dl_random, sizeof (ret));
#if BYTE_ORDER == LITTLE_ENDIAN
  ret.num &= ~(uintptr_t) 0xff;
#elif BYTE_ORDER == BIG_ENDIAN
  ret.num &= ~((uintptr_t) 0xff << (8 * (sizeof (ret) - 1)));
#else
# error "BYTE_ORDER unknown"
#endif
  return ret.num;
}

可以看到，canary 的值与 dl_random 的值相近，不同的是会在低字节处强制置为 \x00 防止泄露，而该逻辑也与我们之前观察得到的结论相符。

2. Canary 保存

我们还是先从 __libc_start_init 函数出发：

  /* Set up the stack checker's canary.  */
  uintptr_t stack_chk_guard = _dl_setup_stack_chk_guard (_dl_random);
# ifdef THREAD_SET_STACK_GUARD
  THREAD_SET_STACK_GUARD (stack_chk_guard);
# else
  __stack_chk_guard = stack_chk_guard;
# endif

如果设置了 THREAD_SET_STACK_GUARD 宏，即启用了线程栈保护，那么这个 canary 值就会设置进线程局部存储里：

// sysdeps\x86_64\nptl\tls.h
/* Set the stack guard field in TCB head.  */
# define THREAD_SET_STACK_GUARD(value) \
    THREAD_SETMEM (THREAD_SELF, header.stack_guard, value)

其中，THREAD_SELF 指的是当前线程的线程控制块：

// sysdeps\x86_64\nptl\tls.h
/* Return the thread descriptor for the current thread.

   The contained asm must *not* be marked volatile since otherwise
   assignments like
  pthread_descr self = thread_self();
   do not get optimized away.  */
# define THREAD_SELF \
  ({ struct pthread *__self;                  \
     asm ("mov %%fs:%c1,%0" : "=r" (__self)           \
    : "i" (offsetof (struct pthread, header.self)));        \
     __self;})

而 pthread 结构体的声明如下，根据注释可以得知 pthread 结构体就是线程控制块结构：

/* Thread descriptor data structure.  */
struct pthread
{
  union
  {
#if !TLS_DTV_AT_TP
    /* This overlaps the TCB as used for TLS without threads (see tls.h).  */
    tcbhead_t header;
#else
    struct
    {
      ...
    } header;
#endif

    /* This extra padding has no special purpose, and this structure layout
       is private and subject to change without affecting the official ABI.
       We just have it here in case it might be convenient for some
       implementation-specific instrumentation hack or suchlike.  */
    void *__padding[24];
  };

  ...
}

由于在 x86_64 架构下，TLS_DTV_AT_TP宏定义为 0：

// sysdeps\x86_64\nptl\tls.h

/* The TCB can have any size and the memory following the address the
   thread pointer points to is unspecified.  Allocate the TCB there.  */
# define TLS_TCB_AT_TP  1
# define TLS_DTV_AT_TP  0

因此 pthread 结构的首个字段为 tcbhead_t header：

// sysdeps\x86_64\nptl\tls.h

typedef struct
{
  void *tcb;    /* Pointer to the TCB.  Not necessarily the
         thread descriptor used by libpthread.  */
  dtv_t *dtv;
  void *self;   /* Pointer to the thread descriptor.  */
  int multiple_threads;
  int gscope_flag;
  uintptr_t sysinfo;
  uintptr_t stack_guard;
  uintptr_t pointer_guard;
  
  ... 
} tcbhead_t;

在结构体 tcbhead_t 中，我们可以看到熟悉的 stack_guard 字段，单个线程的 canary 值就存放在这里。而 tcb 指针和 self 指针，实际指向的都是同一个地址，即 struct pthread 结构体（亦或者是 struct tcbhead_t 本身，这两个结构体地址相同）。

回顾 THREAD_SELF 宏定义，我们不难推断出 %fs 寄存器存放的是 struct pthread 结构体的地址，而 %fs:28h 引用的就是 pthread::tcbhead_t::stack_guard 的地方，与之前 IDA 中显示的一致。

不过不知道为什么要获取 struct pthread 地址得绕这么大弯，得获取其 head 的 self 指针…

这里需要说一下 %fs 寄存器为什么存放的是struct pthread 结构体的地址。看看这个宏定义：

/* Code to initially initialize the thread pointer.  This might need
   special attention since 'errno' is not yet available and if the
   operation can cause a failure 'errno' must not be touched.

   We have to make the syscall for both uses of the macro since the
   address might be (and probably is) different.  */
# define TLS_INIT_TP(thrdescr) \
  ({ void *_thrdescr = (thrdescr);                \
     tcbhead_t *_head = _thrdescr;               \
     int _result;                 \
                        \
     _head->tcb = _thrdescr;                   \
     /* For now the thread descriptor is at the same address.  */       \
     _head->self = _thrdescr;                  \
                        \
     /* It is a simple syscall to set the %fs value for the thread.  */       \
     asm volatile ("syscall"                  \
       : "=a" (_result)               \
       : "0" ((unsigned long int) __NR_arch_prctl),           \
         "D" ((unsigned long int) ARCH_SET_FS),         \
         "S" (_thrdescr)                \
       : "memory", "cc", "r11", "cx");             \
                        \
    _result ? "cannot set %fs base address for thread-local storage" : 0;     \
  })

# define TLS_DEFINE_INIT_TP(tp, pd) void *tp = (pd)

宏定义 TLS_INIT_TP 会调用 SYS_ARCH_SET_FS 系统调用，将 %fs 寄存器的值设置为传入的 pthread 结构体地址。这里也可以看到该宏定义会同步将线程控制块的地址设置进 tcb 指针和 self 指针字段中。

那么何时会调用 TLS_INIT_TP 宏来设置主线程的 TCB 至 %fs 中呢？有两种情况：

在执行 dl_main 函数时，满足某种条件需要提前使用 TLS，于是提早初始化。
在执行 __libc_start_main 函数时，执行其中的 __pthread_initialize_minimal -> __libc_setup_tls 函数调用链。

无论哪种可能，这两种情况都会在创建 canary 前完成。尤其是第二种，几乎贴着创建 canary 步骤。那么这一整个逻辑就都串起来了：

动态链接器在执行 dl_main 函数前，先初始化 _dl_random 随机数。
控制流在创建 Canary 前，执行TLS_INIT_TP 宏，将 %fs 寄存器设置为主线程的线程控制块地址。
控制流在执行 __libc_start_main之中使用 _dl_random 随机数，生成 canary 值，并将其存放在 %fs 寄存器所指定的线程控制块中用于存放 canary 的字段。

3. Canary 读取

Canary 写入主线程 TLS 的流程有了，那么要如何读取呢？在 sysdeps\x86_64\stackguard-macros.h 中有着这样的一段宏定义:

#define STACK_CHK_GUARD \
  ({ uintptr_t x;           \   
     asm ("mov %%fs:%c1, %0" : "=r" (x)     \
    : "i" (offsetof (tcbhead_t, stack_guard))); x; })

因此只要使用 STACK_CHK_GUARD 宏就能读取出当前线程的 canary 值，例如：

if (stack_chk_guard_copy != STACK_CHK_GUARD)
{
    puts ("STACK_CHK_GUARD changed between constructor and do_test");
    return 1;
}

如果关闭了 THREAD_SET_STACK_GUARD 宏，即关闭线程栈保护，那么计算出来的 canary 值会被保留进全局变量 __stack_chk_guard 中：

// __libc_start_main 函数片段

  /* Set up the stack checker's canary.  */
  uintptr_t stack_chk_guard = _dl_setup_stack_chk_guard (_dl_random);
# ifdef THREAD_SET_STACK_GUARD
  THREAD_SET_STACK_GUARD (stack_chk_guard);
# else
  // 这里!
  __stack_chk_guard = stack_chk_guard;
# endif

仍然可以通过 STACK_CHK_GUARD 宏来获取：

// sysdeps\generic\stackguard-macros.h
    
extern uintptr_t __stack_chk_guard;
#define STACK_CHK_GUARD __stack_chk_guard

STACK_CHK_GUARD 宏在 glibc 中几乎找不到使用点，推测这个宏是为 gcc 编译时加入读取 canary 值的操作所做的准备。

4. TCB 位置

a. 主线程

主线程的 TCB 的内存分配过程过于复杂：

一种是在 __libc_start_main -> __pthread_initialize_minimal -> __libc_setup_tls 函数调用链中，调用 __sbrk 函数在堆内存上分配 TLS。
再一种是在 rtld 的 _dl_allocate_tls_storage 函数中调用 mmap 函数来分配 TLS。

不过看上去大部分程序的 TCB 内存分配都会在 rtld 中提前进行，而不会等到走进 user entry 后才开始。随手写了个程序调试了一下，发现主线程 TLS 果然是通过 mmap 函数创建的：

gdb 无法直接读取 %fs 寄存器的值，会读取到一个 0：

因此需要用 gdb 调用 pthread_self 函数来获取当前线程的 TCB 位置，这个函数较为简单：

pthread_t
__pthread_self (void)
{
  return (pthread_t) THREAD_SELF;
}

这里可以看到用户程序从 %fs:28h 处取出的 Canary 与主线程 TCB 中存放的 Canary 一致，验证之前的分析：

结论：主线程 TLS 位置较为随机，想通过修改主线程 TLS 来改主线程 canary 几乎是不可能的。

b. 子线程

要看子线程的 TCB 与 Canary 逻辑，那就得移步进 pthread_create 函数的实现。这个函数位于 nptl\pthread_create.c 中，有 __pthread_create_2_0 和 __pthread_create_2_1 两个实现版本，不过 2.0 是 2.1 的 wrapper，因此我们将目光放在 2.1 版本的实现上。

这里只看有趣的代码片段：

  struct pthread *pd = NULL;
  int err = ALLOCATE_STACK (iattr, &pd);

  [...]

  /* Initialize the TCB.  All initializations with zero should be
   performed in 'get_cached_stack'.  This way we avoid doing this if
   the stack freshly allocated with 'mmap'.  */

#if TLS_TCB_AT_TP
  /* Reference to the TCB itself.  */
  pd->header.self = pd;

  /* Self-reference for TLS.  */
  pd->header.tcb = pd;
#endif

  [...]
      
  /* Copy the stack guard canary.  */
#ifdef THREAD_COPY_STACK_GUARD
  THREAD_COPY_STACK_GUARD (pd);
#endif

首先，pthread_create 会创建线程栈（每个线程都有一个独立的栈），这个栈可以是用先前的缓存（例如重用被终止线程的栈），也可以是 mmap 出的一个新的栈。有趣的是，新线程的 TCB 会在这个线程栈上创建，那这就使得子线程的 TCB 地址对用户来说不再是随机的，因此可以通过子线程的栈溢出来覆写子线程 TCB 的 Canary。

需要注意的是，在 allocate_stack 这个为子线程分配栈的函数中，TCB（pthread 结构体）将会被放置在整个线程栈的栈底，即线程栈的最最最最底部（也就是最最高地址处）存放的是 TCB。

这个可以验证一下，从网上 CV 了一个 pthread 样例稍微改了下，编译调试：

#include<pthread.h>
#include<stdio.h>
// a simple pthread example 
// compile with -lpthreads

// create the function to be executed as a thread
void *thread(void *ptr)
{
    // tell complier to enable stack canary detection.
    char ch[0x20];
    scanf("%s", ch);
    printf("%s", ch);
}

int main(int argc, char **argv)
{
    // create the thread objs
    pthread_t thread1;
    // start the threads
    pthread_create(&thread1, NULL, *thread, NULL);
    // wait for threads to finish
    pthread_join(thread1, NULL);
    return 0;
}

下个断点在 thread 函数上，然后开跑切换至子线程。此时的线程栈和 TCB 地址如下，可以看到非常的贴近，而且都在同一个内存段上：

之后在线程栈底部找到了这个 Canary，偏移量是 0x878（属实是有点远）：

除了线程栈分配较为有趣以外，下边还有一个 THREAD_COPY_STACK_GUARD宏调用，这个调用会把当前线程的 canary 复制一份进新线程的 TCB 中。注意控制流的基本单位是线程，虽然每个线程的 canary 值都相同，但在验证 canary 时，只会去获取当前 TCB 上存储的 canary 值。也就是说如果以非法手段将子线程的 canary 值改变，那么这种改变不影响其他线程的执行。

整个关于用户层 Canary 机制差不多就是分析的这些内容，这个机制还是比较有趣的。

四、参考

版权声明： 本博客所有文章除特别声明外，著作权归作者所有。转载请注明出处！