linux-kernel-syscall-inside

小菜

当我们在用户层写应用的时候，如果我们锁定一段代码或者更具体一个函数，当我们以递归的方式去研究这个函数的时候，函数栈到头了，最后的代码肯定是一段汇编或者一个syscall即系统调用，一般到这，我们就应该停止了，这样看起来操作系统给我印象永远是一个黑匣子。它干了啥我不知道，但是我如果遵循的它的规则，我总能得到我想要的。

其实这对于写应用的人来说是幸运的。我曾经去读linux代码遇到的最大困难也和这个类似，在用户态看到汇编我就停了，而linux里面的宏居多，函数递归调用更深，还有不同处理器的分支。如果我遵循在用户态开始的想法，读代码的进度会奇慢，甚至低效。

后来买了linux 4 amd64的一本书，在书中我找到了答案，刚开始最重要的是，理解数据结构和数据结构的关系，函数间的调用关系，不要过分探究函数具体实现。若有兴趣，再往下。

syscall到底干了啥？

有很多不同系统调用，比如open,write,read,exit 等等,把操作系统想象成第一个启动的进程，而操作系统能直接操作硬件，把操作系统想象成一个巨大的虚拟机，我们的应用在其基础上来运行，我们应用也需要IO操作，内存操作，网络操作，但是操作系统把硬件和我们的应用完全隔离开来了，所以这个时候操作系统需要给应用相关的接口。

但是这个接口并不是用户态的函数，他还是内核态的过程(下面都用内核来描述操作系统)。当用户态需要调用内核的接口的时候，这个时候就需要告诉内核，我要做一些操作，处理器就可以把当前的用户进程执行切换到内核态，ok在内核态了，现在这些系统调用对应的内核过程可以执行了。

所以syscall就是一个用户态和内核态切换的过程，从r3 切换到r0去执行一些过程。除了系统调用会切换到内核态，那么还有什么过程会呢？还有一个错误发生的时候，比如除0，或者读非法地址。其实这些过程可以统称为两个过程：

expection handler
interrupt handler

系统调用过程就可以归纳于interrupt handler,这里面还有一些东西，需要理清楚: - 用户应用发起系统调用，需要传递一些参数，这些参数如何传递给内核？ - 处理器是怎么从用户态切换到内核态上的？

首先解决第一个问题，这个问题的答案可以在不同arch分支下syscall_entry找到，例如x86_x64下的/arch/x86/entry/entry_64.S：

* Registers on entry:
* rax  system call number
* rcx  return address
* r11  saved rflags (note: r11 is callee-clobbered register in C ABI)
* rdi  arg0
* rsi  arg1
* rdx  arg2
* r10  arg3 (needs to be moved to rcx to conform to C ABI)
* r8   arg4
* r9   arg5
* (note: r12-r15, rbp, rbx are callee-preserved in C ABI)

可以看到不同寄存器分别保存一些值，这里调几个具体讲： * rax 系统调用号 * rcx 为什么要用它来保存用户态返回地址呢？对应系统调用返回指令 sysretq * rcx被上面用掉了，但是rcx在C ABI中是函数调用过程中的参数值保存的地方，被用掉了这里只能先用r10来保存arg3,之后为了对应C ABI，因为linux kernel也是c写的，需要把r10的值放到rcx中。

其他就不用说了，注释写的都很详细。再看第二个问题，处理器是怎么切换到r0内核态的，直接看syscall这个指令的作用：

SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR (after saving the address of the instruction following SYSCALL into RCX). (The WRMSR instruction ensures that the IA32_LSTAR MSR always contain a canonical address.)

SYSCALL also saves RFLAGS into R11 and then masks RFLAGS using the IA32_FMASK MSR (MSR address C0000084H); specifically, the processor clears in RFLAGS every bit corresponding to a bit that is set in the IA32_FMASK MSR.

SYSCALL loads the CS and SS selectors with values derived from bits 47:32 of the IA32_STAR MSR. However, the CS and SS descriptor caches are not loaded from the descriptors (in GDT or LDT) referenced by those selectors. Instead, the descriptor caches are loaded with fixed values. See the Operation section for details. It is the responsibility of OS software to ensure that the descriptors (in GDT or LDT) referenced by those selector values correspond to the fixed values loaded into the descriptor caches; the SYSCALL instruction does not ensure this correspondence.

浓缩一下:

RCX ← RIP;
RIP ← IA32_LSTAR;
R11 ← RFLAGS;
RFLAGS ← RFLAGS AND NOT(IA32_FMASK);
CS.Selector ← IA32_STAR[47:32] AND FFFCH
SS.Selector ← IA32_STAR[47:32] + 8;

其中IA32_LSTARSh和IA32_STAR都是MSR(model special register),分别保存了系统调用的入口点和内核态的CS和SS。可以看到这一步没有涉及到切栈，那么把栈切到内核栈这个过程发生在系统调用的入口点里面。

系统调用入口点

linux kernel的系统调用点过程是用汇编写的，具体就x86_x64来看：

ENTRY(entry_SYSCALL_64)
	swapgs
	movq %rsp, PER_CPU_VAR(rsp_scratch)
	movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

这是第一步，上面说了syscall并不能切栈，那么在entry_syscall_64第一步就是切栈，swapgs=>GS.base ← IA32_KERNEL_GS_BASE;,这一步把GS换成了内核态的GS,内核GS是pre_cpu结构的段地址，里面保存着和每个处理器核心相关结构,这个结构里面就有需要栈地址，对应了紧接着的两步movq,这就完成了栈的切换。

接下来就是保存此时用户态的各个寄存器的状态:

pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(rsp_scratch) /* pt_regs->sp */
pushq %r11 /* pt_regs->flags */
pushq $__USER_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
pushq %rax /* pt_regs->orig_ax */
pushq %rdi /* pt_regs->di */
pushq %rsi /* pt_regs->si */
pushq %rdx /* pt_regs->dx */
pushq %rcx /* pt_regs->cx */
pushq $-ENOSYS /* pt_regs->ax */
pushq %r8 /* pt_regs->r8 */
pushq %r9 /* pt_regs->r9 */
pushq %r10 /* pt_regs->r10 */
pushq %r11 /* pt_regs->r11 */
sub $(6*8), %rsp /* pt_regs->bp, bx, r12-15 not saved */

对应着一个栈上的结构pt_regs,接下来就是根据rax传递进来的具体系统调用号去找对应的调用过程:

1	call *sys_call_table(, %rax, 8)

这个sys_call_table相当于是一张系统调用表：

asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
    [0 ... __NR_syscall_max] = &sys_ni_syscall,
    #include <asm/syscalls_64.h>
};

其中syscalls_64.h是编译过程中产生的如下:

#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
#define __SYSCALL_64(nr, sym, compat) [nr] = sym,

__SYSCALL_COMMON(0, sys_read, sys_read)
__SYSCALL_COMMON(1, sys_write, sys_write)
__SYSCALL_COMMON(2, sys_open, sys_open)
__SYSCALL_COMMON(3, sys_close, sys_close)
__SYSCALL_COMMON(5, sys_newfstat, sys_newfstat)

最终syscall_table就如下：

asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
    [0 ... __NR_syscall_max] = &sys_ni_syscall,
    [0] = sys_read,
    [1] = sys_write,
    [2] = sys_open,
    ...
    ...
    ...
};

sys_read, sys_write这些函数的定义如何而来：

#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)

#define SYSCALL_DEFINEx(x, sname, ...)                \
        SYSCALL_METADATA(sname, x, __VA_ARGS__)       \
        __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
        
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
        size_t, count)
{
    ...
}

最后的效果如下:

1	asmlinkage long sys_write(unsigned int fd, const char __user * buf, size_t count);

need to know

CVE-2009-0029 这个CVE可以看看。
#define __NR_syscall_max 非固定编译时候产生的
系统调用过程中x32的兼容模式处理可以注意下。
syscall_entry 中存在的debug 和 trace 过程可以去细究，比如trace可能就是seccomp的实现过程。
除了sysret可以返回 iret也可以返回，返回处理有一定区别！

maplgebra

a long time ago, in a galaxy far far away

linux-kernel-syscall-inside

小菜

syscall到底干了啥？

系统调用入口点

need to know

资料