online judge sandbox 设计思路(1)
This article is deprecated, please see here for more details.
实现 OJ 的 sandbox 主要有两种思路:ptrace 和 seccomp,下面我们分别讨论。
ptrace()
ptrace 是类 Unix 系统上一个可以观察、控制其他进程内部状态的工具。它的用途有很多,例如程序调试(gdb、dbx等)、代码覆盖率检测、甚至可以用来做运行时补丁(想没想起来IDA和OllyDbg )。今天我们讨论的是如何通过对系统调用的限制来达到在沙箱(sandbox)中运行程序的效果。
我们先写一个程序,用来删除 /etc/hosts 文件:
[root@localhost ~]# cat a.c #include <stdlib.h> int main() { unlink("/etc/hosts"); return 0; } [root@localhost ~]# gcc a.c [root@localhost ~]# ll -al /etc/hosts -rw-r--r--. 1 root root 158 Jun 7 2013 /etc/hosts [root@localhost ~]# ./a.out [root@localhost ~]# ll -al /etc/hosts ls: cannot access /etc/hosts: No such file or directory
a.out 完成了它的使命,成功的删除了 hosts 文件。
现在,我们需要设计一个程序:通过 ptrace() 来限制 unlink() 删除文件系统中的文件。
首先,我们来看下手册上是怎么介绍 ptrace() 的:
#include <sys/ptrace.h> long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
其中,request 参数决定了 ptrace() 的行为,我们选几个用得到的选项介绍:
- PTRACE_TRACEME: 表明子进程(tracee)正在被父进程(tracer)所控制。注意:只有本选项适用于子进程(tracee);
- PTRACE_PEEKUSER: 可以从子进程中获取通用寄存器中的值,寄存器(orig_rax /orig_eax,取决于平台 )中保存的是最近一次的syscall number;
- PTRACE_SYSCALL: 可以将子进程从停止的状态唤醒并继续;
这样,我们就有思路了:每次程序调用系统调用,我们都去看看本次系统调用是不是 unlink(),如果是,杀掉子进程并退出即可:
#include <sys/ptrace.h> #include <sys/types.h> #include <sys/wait.h> #include <sys/user.h> #include <syscall.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #if __WORDSIZE == 64 #define REG(reg) reg.orig_rax #else #define REG(reg) reg.orig_eax #endif int main(int argc, char* argv[]) { pid_t child; child = fork(); if(child == 0) { ptrace(PTRACE_TRACEME, 0, NULL, NULL); execl("/root/a.out", "a.out", NULL); } else { int status; while(waitpid(child, &status, 0) && ! WIFEXITED(status)) { struct user_regs_struct regs; ptrace(PTRACE_GETREGS, child, NULL, ®s); if (REG(regs) == 87) { fprintf(stderr, "error: call syscall unlink() is not allowed!\n"); kill(child, SIGKILL); return 0; } fprintf(stdout, "Process executed systemcallID: [%ld]\n", REG(regs)); ptrace(PTRACE_SYSCALL, child, NULL, NULL); } } exit(0); }
当系统调用发生时,内核会把当前的 %eax/%rax 中的内容(即系统调用的编号)保存到子进程的用户态代码段中,我们可以传入 PTRACE_PEEKUSER 调用 ptrace() 来读取这个的值,然后通过传入 PTRACE_SYSCALL 调用 ptrace() 让子进程重新恢复运行。一旦发现子进程调用了 87 号系统调用(sys_unlink)就会终止操作,并退出。
我们编译运行下:
[root@localhost ~]# ll /etc/hosts -rw-r--r--. 1 root root 158 Apr 27 03:05 /etc/hosts [root@localhost ~]# ./pt Process executed systemcallID: 59 Process executed systemcallID: 12 Process executed systemcallID: 12 Process executed systemcallID: 9 Process executed systemcallID: 9 Process executed systemcallID: 21 Process executed systemcallID: 21 Process executed systemcallID: 2 Process executed systemcallID: 2 Process executed systemcallID: 5 Process executed systemcallID: 5 Process executed systemcallID: 9 Process executed systemcallID: 9 Process executed systemcallID: 3 Process executed systemcallID: 3 Process executed systemcallID: 2 Process executed systemcallID: 2 Process executed systemcallID: 0 Process executed systemcallID: 0 Process executed systemcallID: 5 Process executed systemcallID: 5 Process executed systemcallID: 9 Process executed systemcallID: 9 Process executed systemcallID: 10 Process executed systemcallID: 10 Process executed systemcallID: 9 Process executed systemcallID: 9 Process executed systemcallID: 9 Process executed systemcallID: 9 Process executed systemcallID: 3 Process executed systemcallID: 3 Process executed systemcallID: 9 Process executed systemcallID: 9 Process executed systemcallID: 9 Process executed systemcallID: 9 Process executed systemcallID: 158 Process executed systemcallID: 158 Process executed systemcallID: 10 Process executed systemcallID: 10 Process executed systemcallID: 10 Process executed systemcallID: 10 Process executed systemcallID: 10 Process executed systemcallID: 10 Process executed systemcallID: 11 Process executed systemcallID: 11 Process executed systemcallID: 87 error: call syscall unlink() is not allowed! [root@localhost ~]# ll /etc/hosts -rw-r--r--. 1 root root 158 Apr 27 03:05 /etc/hosts
可以看到:
(1) 在删除文件这个操作过程中,发生了很多次系统调用;
实际上,strace() 也是基于这个原理,我们可以用 strace 跟踪下执行 /root/a.out 是什么结果:
[root@localhost ~]# strace /root/a.out execve("/root/a.out", ["/root/a.out"], [/* 27 vars */]) = 0 brk(0) = 0x19c2000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8f962c7000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=29929, ...}) = 0 mmap(NULL, 29929, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f8f962bf000 close(3) = 0 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \34\2\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=2107816, ...}) = 0 mmap(NULL, 3932736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f8f95ce6000 mprotect(0x7f8f95e9c000, 2097152, PROT_NONE) = 0 mmap(0x7f8f9609c000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f8f9609c000 mmap(0x7f8f960a2000, 16960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f8f960a2000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8f962be000 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8f962bc000 arch_prctl(ARCH_SET_FS, 0x7f8f962bc740) = 0 mprotect(0x7f8f9609c000, 16384, PROT_READ) = 0 mprotect(0x600000, 4096, PROT_READ) = 0 mprotect(0x7f8f962c8000, 4096, PROT_READ) = 0 munmap(0x7f8f962bf000, 29929) = 0 unlink("/etc/hosts") = -1 ENOENT (No such file or directory) exit_group(0) = ? +++ exited with 0 +++
通过系统调用表对比着看,strace() 给出的分析和 ptrace() 给出的是完全相同的。
(2) hosts 文件还在,程序现在无法调用 unlink() 来删除文件了;
同样的思路:我们可以维护一个系统调用白名单,用户上传的代码调用白名单上的系统调用一律放行,否则直接报错 Runtime Error 即可。
以上是使用 ptrace() 来实现 judger 的一个思路。虽然可以达到我们的目的,但是 ptrace() 的缺点也是显而易见的:
- 诡异的模型和侵入式的检测;
- 每进行一次系统调用,ptrace() 就要在系统空间和用户空间进行两次切换,过于浪费资源。
针对以上两个缺点,我们下面会继续探讨 seccomp(secure computing module) 的实现。
参考:
http://man7.org/linux/man-pages/man2/ptrace.2.html
http://www.linuxjournal.com/article/6100
http://www.linuxjournal.com/article/6210