online judge sandbox 设计思路(1)

This article is deprecated, please see here for more details.

实现 OJ 的 sandbox 主要有两种思路:ptrace 和 seccomp,下面我们分别讨论。

ptrace()

ptrace 是类 Unix 系统上一个可以观察、控制其他进程内部状态的工具。它的用途有很多,例如程序调试(gdb、dbx等)、代码覆盖率检测、甚至可以用来做运行时补丁(想没想起来IDA和OllyDbg )。今天我们讨论的是如何通过对系统调用的限制来达到在沙箱(sandbox)中运行程序的效果。

我们先写一个程序,用来删除 /etc/hosts 文件:

[root@localhost ~]# cat a.c 
#include <stdlib.h>
int main() {
    unlink("/etc/hosts");
    return 0;
}
[root@localhost ~]# gcc a.c 
[root@localhost ~]# ll -al /etc/hosts
-rw-r--r--. 1 root root 158 Jun 7 2013 /etc/hosts
[root@localhost ~]# ./a.out 
[root@localhost ~]# ll -al /etc/hosts
ls: cannot access /etc/hosts: No such file or directory

a.out 完成了它的使命,成功的删除了 hosts 文件。

现在,我们需要设计一个程序:通过 ptrace() 来限制 unlink() 删除文件系统中的文件。

首先,我们来看下手册上是怎么介绍 ptrace() 的:

#include <sys/ptrace.h>
long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);

其中,request 参数决定了 ptrace() 的行为,我们选几个用得到的选项介绍:

  • PTRACE_TRACEME: 表明子进程(tracee)正在被父进程(tracer)所控制。注意:只有本选项适用于子进程(tracee);
  • PTRACE_PEEKUSER: 可以从子进程中获取通用寄存器中的值,寄存器(orig_rax /orig_eax,取决于平台 )中保存的是最近一次的syscall number;
  • PTRACE_SYSCALL: 可以将子进程从停止的状态唤醒并继续;

这样,我们就有思路了:每次程序调用系统调用,我们都去看看本次系统调用是不是 unlink(),如果是,杀掉子进程并退出即可:

#include <sys/ptrace.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/user.h>
#include <syscall.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#if __WORDSIZE == 64
#define REG(reg) reg.orig_rax
#else
#define REG(reg) reg.orig_eax
#endif

int main(int argc, char* argv[]) { 
    pid_t child;
    child = fork();
    if(child == 0) {
        ptrace(PTRACE_TRACEME, 0, NULL, NULL);
        execl("/root/a.out", "a.out", NULL);
    } else {
        int status;
        while(waitpid(child, &status, 0) && ! WIFEXITED(status)) {
            struct user_regs_struct regs;
            ptrace(PTRACE_GETREGS, child, NULL, &regs);
            if (REG(regs) == 87) {
                fprintf(stderr, "error: call syscall unlink() is not allowed!\n");
                kill(child, SIGKILL);
                return 0;
            }
            fprintf(stdout, "Process executed systemcallID: [%ld]\n", REG(regs));
            ptrace(PTRACE_SYSCALL, child, NULL, NULL);
        }
    }
    exit(0);
}

当系统调用发生时,内核会把当前的 %eax/%rax 中的内容(即系统调用的编号)保存到子进程的用户态代码段中,我们可以传入 PTRACE_PEEKUSER 调用 ptrace() 来读取这个的值,然后通过传入 PTRACE_SYSCALL 调用 ptrace() 让子进程重新恢复运行。一旦发现子进程调用了 87 号系统调用(sys_unlink)就会终止操作,并退出。

我们编译运行下:

[root@localhost ~]# ll /etc/hosts
-rw-r--r--. 1 root root 158 Apr 27 03:05 /etc/hosts
[root@localhost ~]# ./pt
Process executed systemcallID: 59
Process executed systemcallID: 12
Process executed systemcallID: 12
Process executed systemcallID: 9
Process executed systemcallID: 9
Process executed systemcallID: 21
Process executed systemcallID: 21
Process executed systemcallID: 2
Process executed systemcallID: 2
Process executed systemcallID: 5
Process executed systemcallID: 5
Process executed systemcallID: 9
Process executed systemcallID: 9
Process executed systemcallID: 3
Process executed systemcallID: 3
Process executed systemcallID: 2
Process executed systemcallID: 2
Process executed systemcallID: 0
Process executed systemcallID: 0
Process executed systemcallID: 5
Process executed systemcallID: 5
Process executed systemcallID: 9
Process executed systemcallID: 9
Process executed systemcallID: 10
Process executed systemcallID: 10
Process executed systemcallID: 9
Process executed systemcallID: 9
Process executed systemcallID: 9
Process executed systemcallID: 9
Process executed systemcallID: 3
Process executed systemcallID: 3
Process executed systemcallID: 9
Process executed systemcallID: 9
Process executed systemcallID: 9
Process executed systemcallID: 9
Process executed systemcallID: 158
Process executed systemcallID: 158
Process executed systemcallID: 10
Process executed systemcallID: 10
Process executed systemcallID: 10
Process executed systemcallID: 10
Process executed systemcallID: 10
Process executed systemcallID: 10
Process executed systemcallID: 11
Process executed systemcallID: 11
Process executed systemcallID: 87
error: call syscall unlink() is not allowed!
[root@localhost ~]# ll /etc/hosts
-rw-r--r--. 1 root root 158 Apr 27 03:05 /etc/hosts

可以看到:
(1) 在删除文件这个操作过程中,发生了很多次系统调用;

实际上,strace() 也是基于这个原理,我们可以用 strace 跟踪下执行 /root/a.out 是什么结果:

[root@localhost ~]# strace /root/a.out
execve("/root/a.out", ["/root/a.out"], [/* 27 vars */]) = 0
brk(0)                                  = 0x19c2000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8f962c7000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=29929, ...}) = 0
mmap(NULL, 29929, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f8f962bf000
close(3)                                = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \34\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2107816, ...}) = 0
mmap(NULL, 3932736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f8f95ce6000
mprotect(0x7f8f95e9c000, 2097152, PROT_NONE) = 0
mmap(0x7f8f9609c000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f8f9609c000
mmap(0x7f8f960a2000, 16960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f8f960a2000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8f962be000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8f962bc000
arch_prctl(ARCH_SET_FS, 0x7f8f962bc740) = 0
mprotect(0x7f8f9609c000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ)     = 0
mprotect(0x7f8f962c8000, 4096, PROT_READ) = 0
munmap(0x7f8f962bf000, 29929)           = 0
unlink("/etc/hosts")                    = -1 ENOENT (No such file or directory)
exit_group(0)                           = ?
+++ exited with 0 +++

通过系统调用表对比着看,strace() 给出的分析和 ptrace() 给出的是完全相同的。

(2) hosts 文件还在,程序现在无法调用 unlink() 来删除文件了;

同样的思路:我们可以维护一个系统调用白名单,用户上传的代码调用白名单上的系统调用一律放行,否则直接报错 Runtime Error 即可。

以上是使用 ptrace() 来实现 judger 的一个思路。虽然可以达到我们的目的,但是 ptrace() 的缺点也是显而易见的:

  1. 诡异的模型和侵入式的检测;
  2. 每进行一次系统调用,ptrace() 就要在系统空间和用户空间进行两次切换,过于浪费资源。

针对以上两个缺点,我们下面会继续探讨 seccomp(secure computing module) 的实现。

参考:

http://man7.org/linux/man-pages/man2/ptrace.2.html

http://www.linuxjournal.com/article/6100

http://www.linuxjournal.com/article/6210

https://gist.github.com/willb/14488

ptrace() Tutorial

Leave a Reply

Your email address will not be published. Required fields are marked *

*