Online Judge from Scratch(3) – Sandbox

This article consists of two parts: the sandbox for GCC and the sandbox for the compiled binary.

the sandbox for GCC

Essentially, our sandbox for GCC is a wrapper of GCC with a watchdog, just like the sandbox we designed for the Java compiler. However, there are more situations need to be considered carefully for GCC.

In our Java sandbox, if we find the Java compiler runs longer than we expected, the Java sandbox will send -KILL to the process id of the Java compiler, and the Java compiler process would be terminated immediately. Unlike the Java compiler, GCC invokes several child processes to complete the whole compiling progress. If we send signal -KILL or -TERM straight to the process id of GCC, the child process(if exists) will become a zombie, which will get rid of our sandbox. Things could get even more dangerous if the submission happens to be a compiler bomb.

On Linux, we can terminate a process with all its child by sending -KILL or -TERM to the negative Process Group ID:

 

Golang has provided a convenient way for setting a timer: AfterFunc. AfterFunc waits for the duration to elapse and then calls the callback function in its own goroutine. It returns a Timer that can be used to cancel the call using its Stop method. So we can set our GCC sandbox’s timer like this:

Even terminating all processes may be not secure enough for us handling a compiler bomb. GCC would generate temporary files(default in /tmp) during the compiling process, and after receiving -KILL, GCC(with all its child processes) would quit ASAP without removing temporary compiling file properly. For a compiler bomb like int main[0x10000000]={1};, temporary files could get extremely large, and the server’s hard disk may get exhausted soon. The work around is saving GCC’s temporary compiling file under the same directory Justice saving the submission code(by adding a parameter -save-temps to GCC): Justice would clean them up all together at last.

Information leaking we’ve discussed before also applies to our sandbox for GCC.

And last but not least: GCC may output too much error message sometimes, so we add a parameter like -fmax-errors=10 in case we have to log the error message without occupying too much IO.

Now, we don’t need to worry about the security risks during our compiling process. Next, let’s take a look at how to implement the sandbox for the compiled binary — the real interesting part.

the sandbox for the compiled binary

The goal of sandbox is to reduce the potential attack surface by reducing the actions that a process can perform, like: an infinite loop, fork() bomb, sleep like forever, network connection, huge memory allocation, etc. How to implement a sandbox has been discussed in my previous blogs, next we’ll review the technologies briefly:

ptrace()

Operating system offers a standard mechanism called system calls to access the underlying hardware and low level APIs, like file systems. Each system call has a unique call number which is used by kernel to identify which system call is invoked. User-level function copies the system call number into specific CPU registers(on the i386 architecture, the system call number is put in the register %eax), then executes trap instruction (int 0x80, most modern machines use SYSENTER rather than 0x80 trap instruction). This instruction causes the processor to switch from ‘User Mode‘ to ‘Kernel Mode‘, and the kernel will execute the system call after examining the arguments.

However, if ptrace() tells the kernel that the child process is being traced, and before the child process executes the system call, the kernel stops the child process and hands over its control to the parent process. With this ability, if the sandbox(the parent process) finds there is a “dangerous” system call on its black list while the compiled binary(the child process) is running, it can terminate the child process immediately. e.g., we submit a piece of code as follows(BACK UP file /etc/hosts before running it):

Here is a sandbox implemented with ptrace() that can prevent child process from invoking the unlink() system call:

Running the sandbox outputs:

root@justice:~# ./ptrace
Child process invoked a system call, ID: [59]
Child process invoked a system call, ID: [12]
Child process invoked a system call, ID: [12]
Child process invoked a system call, ID: [21]
....
Child process invoked a system call, ID: [10]
Child process invoked a system call, ID: [10]
Child process invoked a system call, ID: [11]
Child process invoked a system call, ID: [11]
Invoking system call unlink() is not allowed!

seccomp

Linux seccomp (short for SECure COMputing) is another efficient way of filtering that allows one to specify which system calls a process should be allowed to invoke, reducing the kernel surface exposed to applications. This provides a clearly defined mechanism to build sandbox environments, where processes can run having access only to a specific reduced set of system calls. Seccomp now supports two modes:

In mode SECCOMP_SET_MODE_STRICT, seccomp only allows read(), write()_exit() and sigreturn() system calls to be invoked, making it only possible to read/write already opened files and exit. If a process attempts to invoke any other system calls, the kernel will terminate it by sending a SIGKILL signal:

In mode SECCOMP_SET_MODE_FILTER, the system calls allowed can be defined by a pointer to a Berkeley Packet Filter (BPF). Additionally, instead of simply terminating the process, the filter can raise a signal, which allows the signal handler to simulate the effect of a disallowed system call (or simply gather more information on the failure for debugging purposes). Now we sandbox the unlink() system call by using seccomp:

Besides filtering “bad” system calls, several improvements can help us strengthen our sandbox :

  • setuid() means set user ID upon execution. If setuid() bit turned on a file, user executing that executable file gets the permissions of the individual or group that owns the file, and this can help us run the untrusted codes under a normal user privilege.
  • setrlimit() can set a limit on both CPU run time and memory usage.

Next, we’ll discuss how Justice implements its sandbox.

Namespaces

Linux namespaces provide isolation of global system resources between independent processes without low level virtualization technology like kvm. Changes applied to the global resource are visible to the processes that are members of the same namespace, but are invisible to the other processes. Linux now provides the following namespaces:

Justice isolates its sandbox with all these namespaces above with subtle differences:

  • Justice implements CLONE_NEWPID but doesn’t mount /proc directory(actually you can imagine the sandbox is a big lonely vacuum, nothing exists there except the lonely compiled binary lies in the lonely root directory /, alone), since it does not need to collect the processes information running in the sandbox;
  • Justice implements CLONE_NEWNET without providing any network tools, since the sandbox doesn’t need any network connection at all.

This is a different view of sandbox: we don’t focus on whether the system calls are safe or not, instead, we jail the compiled binary into another namespace that any system call wouldn’t interfere the host.

By default, CLONE_NEWUSER is disabled on CentOS7 / RHEL7, errors may still occur after enabling this feature by the following command:

 grubby --args="user_namespace.enable=1" --update-kernel="$(grubby --default-kernel)" && reboot now

and that’s why we recommend deploying Justice in Ubuntu 16.04 LTS rather than the others.

CGroups

However, namespaces do not restrict access to physical resources such as CPU, memory and disk. That access is metered and restricted by a kernel feature called CGroups(short for Control Groups). Linux provides the following CGroups subsystems:

  • blkio — this subsystem sets limits on input/output access to and from block devices such as physical drives (disk, solid state, USB, etc.).
  • cpu — this subsystem sets limits on the available CPU time.
  • cpuacct — this subsystem generates automatic reports on CPU resources used by tasks in a CGroup.
  • cpuset — this subsystem assigns individual CPUs (on a multicore system) and memory nodes to tasks in a CGroup.
  • devices — this subsystem allows or denies access to devices by tasks in a CGroup.
  • freezer — this subsystem suspends or resumes tasks in a CGroup.
  • memory — this subsystem sets limits on memory use by tasks in a CGroup and generates automatic reports on memory resources used by those tasks.
  • net_cls — this subsystem tags network packets with a class identifier (classid) that allows the Linux traffic controller (tc) to identify packets originating from a particular CGroup task.
  • net_prio — this subsystem provides a way to dynamically set the priority of network traffic per network interface.
  • ns — this is the namespace subsystem.

Justice implements three of all the subsystems: cpu, memory(both memory and kernel memory) and pids. Clean-ups for empty CGroups is unnecessary, since Linux provides a way of auto-removing empty CGroups.

Linux namespaces and CGroups are also the underlying technologies Docker container relies on.

Conclusion

We briefly introduced three ways of sandboxing the untrusted C/CPP codes:

  1. ptrace()
  2. seccomp with setuid(), setgid() and setrlimit()
  3. Linux namespaces and CGroups

For now, (3) is the most convenient way of sandbox on Linux. Similarly, Golang and Rust can also be easily integrated into Justice.

3 response to "Online Judge from Scratch(3) – Sandbox"

Leave a Reply

Your email address will not be published. Required fields are marked *

*