OS EX Virtualization

操作系统上的进程

操作系统启动后到底做了什么？

CPU Reset → Firmware → Boot loader → Kernel _start()

操作系统会加载第一个程序，从此以后，Linux Kernel 就进入后台，用 syscall 创造整个世界

使用 pstree 观察 systemd 进程

定制最小的 Linux

没有存储设备，只有包含几个文件的 initramfs

├── initramfs
│  ├── bin
│  │  └── busybox
│  ├── code
│  │  ├── a.c
│  │  └── hello
│  └── init
├── Makefile
└── vmlinuz

其中 init 即为操作系统加载的第一个程序

#!/bin/busybox sh

c1="arch ash base64 cat chattr chgrp chmod chown conspy cp cpio cttyhack date dd df dmesg dnsdomainname dumpkmap echo ed egrep false fatattr fdflush fgrep fsync getopt grep gunzip gzip hostname hush ionice iostat ipcalc kbd_mode kill link linux32 linux64 ln login ls lsattr lzop makemime mkdir mknod mktemp more mount mountpoint mpstat mt mv netstat nice nuke pidof ping ping6 pipe_progress printenv ps pwd reformime resume rev rm rmdir rpm run-parts scriptreplay sed setarch setpriv setserial sh sleep stat stty su sync tar touch true umount uname usleep vi watch zcat"
c2="[ [[ awk basename bc beep blkdiscard bunzip2 bzcat bzip2 cal chpst chrt chvt cksum clear cmp comm crontab cryptpw cut dc deallocvt diff dirname dos2unix dpkg dpkg-deb du dumpleases eject env envdir envuidgid expand expr factor fallocate fgconsole find flock fold free ftpget ftpput fuser groups hd head hexdump hexedit hostid id install ipcrm ipcs killall last less logger logname lpq lpr lsof lspci lsscsi lsusb lzcat lzma man md5sum mesg microcom mkfifo mkpasswd nc nl nmeter nohup nproc nsenter nslookup od openvt passwd paste patch pgrep pkill pmap printf pscan"
c3="pstree pwdx readlink realpath renice reset resize rpm2cpio runsv runsvdir rx script seq setfattr setkeycodes setsid setuidgid sha1sum sha256sum sha3sum sha512sum showkey shred shuf smemcap softlimit sort split ssl_client strings sum sv svc svok tac tail taskset tcpsvd tee telnet test tftp time timeout top tr traceroute traceroute6 truncate ts tty ttysize udhcpc6 udpsvd unexpand uniq unix2dos unlink unlzma unshare unxz unzip uptime users uudecode uuencode vlock volname w wall wc wget which who whoami whois xargs xxd xz xzcat yes"
for cmd in $c1 $c2 $c3; do
  /bin/busybox ln -s /bin/busybox /bin/$cmd
done
mkdir -p /proc && mount -t proc  none /proc
mkdir -p /sys  && mount -t sysfs none /sys
export PS1='(linux) '

# Rock'n Roll!
/bin/busybox sh

主要使用工具为 busybox

首先将一些命令建立符号链接

这样就不必输入 /bin/busybox ls 而可以直接输入 ls

然后挂载了一些目录，将部分系统信息暴露给应用程序

例如可以使用 pstree 或 top 命令查看进程信息

接着修改 PS1，即终端的提示符

最后进入 shell，注意 shell 是不会返回的

另外，可以直接在文件系统中添加静态链接的二进制文件，比如 hello

也可以使用 vi 编辑代码

只不过没有 gcc

加上 vmlinuz 内核镜像就可以在 QEMU 里启动了

make && make run

有一定概率失败，不知道如何退出

OS API Overview

进程（状态机）管理
- fork, execve, exit - 状态机的创建/改变/删除
存储（地址空间）管理
- mmap - 虚拟地址空间管理
文件（数据对象）管理
- open, close, read, write - 文件访问管理
- mkdir, link, unlink - 目录管理

fork()

虚拟化 → 操作系统在物理内存中保存多个状态机

为此，我们需要有创建状态机的 API

int fork();

立即复制状态机，内存加寄存器
新创建进程返回 0
执行 fork 的进程返回子进程的进程号

Fork Bomb

Don’t try it (or try it in docker)

https://www.geeksforgeeks.org/fork-bomb/

:(){ :|: & };:

ex1

pid_t pid1 = fork();
pid_t pid2 = fork();
pid_t pid3 = fork();
printf("Hello World from (%d, %d, %d)\n", pid1, pid2, pid3);

画 fork 图

flowchart TD A["(x, x, x)"] --- B["(p1, x, x)"] A --- C["(0, x, x)"] B --- D["(p1, p2, x)"] B --- E["(p1, 0, x)"] C --- F["(0, p3, x)"] C --- G["(0, 0, x)"] D --- H["(p1, p2, p4)"] D --- I["(p1, p2, 0)"] E --- J["(p1, 0, p5)"] E --- K["(p1, 0, 0)"] F --- L["(0, p3, p6)"] F --- M["(0, p3, 0)"] G --- N["(0, 0, p7)"] G --- O["(0, 0, 0)"]

理论上，一共有 7 种不同的 pid，共 8 行输出

可能的输出如下，并发程序输出不确定

Hello World from (35220, 35221, 35222)
Hello World from (35220, 35221, 0)
Hello World from (35220, 0, 35224)
Hello World from (0, 35223, 35225)
Hello World from (35220, 0, 0)
Hello World from (0, 0, 35226)
Hello World from (0, 0, 0)
Hello World from (0, 35223, 0)

ex2

for (int i = 0; i < 2; i++) {
  fork();
  printf("Hello\n");
}

等价于

fork();
printf("Hello\n");
fork();
printf("Hello\n");

理论上共 6 行输出

                   ┌─────────
                   │  print
                   │
            print  │  print
         ┌─────────┴─────────
         │
         │
         │            print
         │         ┌─────────
         │         │
         │  print  │  print
─────────┴─────────┴─────────

然而注意到子进程会继承父进程的缓冲区

所以上述分析是在 line buffer 的假设之下的，即 \n 会清空缓冲区

若使用管道，如

./a.out | cat

会有 8 行输出

                   ┌────────────
                   │  print × 2
                   │
                   │  print × 2
         ┌─────────┴────────────
         │
         │
         │            print × 2
         │         ┌────────────
         │         │
         │         │  print × 2
─────────┴─────────┴────────────

由于管道是 full buffer 的，直到缓冲区被填满，才调用 write 系统调用

或者不使用 \n，这样

./a.out
./a.out | cat

都是 8 行输出

或者使用 fflush 强制刷新缓冲区，这样

./a.out
./a.out | cat

都是 6 行输出

或者使用 setbuf(stdout, NULL) 设置为无缓冲，这样

./a.out
./a.out | cat

都是 6 行输出

总结来说

及时清空缓冲区，或者没有缓冲区，则为 6 行
直到缓冲区被填满，才 write，则为 8 行

ex3

多线程程序的某个线程执行 fork()，应该发生什么？

execve()

将当前运行的状态机重置成成另一个程序的初始状态

int execve(const char *filename, char * const argv, char * const envp);

执行名为 filename 的程序
允许对新状态机设置参数 argv 和环境变量 envp
刚好对应了 main() 的参数

一个例子

int main() {
  char * const argv[] = {
    "/bin/bash", "-c", "env", NULL,
  };
  char * const envp[] = {
    "HELLO=WORLD", NULL,
  };
  execve(argv[0], argv, envp);
  printf("Hello, World!\n");
}

相当于在命令行中键入

$ bash -c env

不过自定义了环境变量

似乎 bash 的配置也没有了，没有颜色了……

可以通过 strace 观察参数的传递

execve 不会返回，所以看不到 Hello, World!

于是可以 hack 一下 PATH

virtualization$ PATH= /bin/gcc execve-demo.c
gcc: fatal error: cannot execute ‘as’: execvp: No such file or directory
compilation terminated.

观察一下对 PATH 的解析

PATH=x:y:z /usr/bin/strace -f /bin/gcc execve-demo.c |& vim -

找不到汇编器了

[pid 71449] execve("x/as", ["as", "--64", "-o", "/tmp/ccvUwIG0.o", "/tmp/ccVFiNO3.s"], 0x531040 /* 74 vars */) = -1 ENOENT (No such file or directory)
[pid 71449] execve("y/as", ["as", "--64", "-o", "/tmp/ccvUwIG0.o", "/tmp/ccVFiNO3.s"], 0x531040 /* 74 vars */) = -1 ENOENT (No such file or directory)
[pid 71449] execve("z/as", ["as", "--64", "-o", "/tmp/ccvUwIG0.o", "/tmp/ccVFiNO3.s"], 0x531040 /* 74 vars */) = -1 ENOENT (No such file or directory)

`_exit()`

立即摧毁状态机

exit 的几种写法

exit(0) - stdlib.h 中声明的 libc 函数
- 会调用 atexit
_exit(0) - glibc 的 syscall wrapper
- 执行 exit_group 系统调用终止整个进程，包括其中的所有线程
- 不会调用 atexit
syscall(SYS_exit, 0)
- 执行 exit 系统调用终止当前线程
- 不会调用 atexit

RTFM

man 2 exit

进程的地址空间

observation

和 readelf 里的信息互相验证

如何查看

pmap [pid]
/proc/[pid]/maps

三个例子

minimal.S

readelf 查看

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x00000000000000b0 0x00000000000000b0  R      0x1000
  LOAD           0x0000000000001000 0x0000000000401000 0x0000000000401000
                 0x000000000000004a 0x000000000000004a  R E    0x1000

使用 gdb 调试

starti 后键入 info inferiors 得到进程号

minimal$ pmap 106597
106597:   /home/vgalaxy/Desktop/virtual-machine-repository/code/minimal/a.out
0000000000400000      4K r---- a.out
0000000000401000      4K r-x-- a.out
00007ffff7ff9000     16K r----   [ anon ]
00007ffff7ffd000      8K r-x--   [ anon ]
00007ffffffde000    132K rw---   [ stack ]
ffffffffff600000      4K --x--   [ anon ]
 total              168K

/proc/[pid]/maps 里有更详细的信息

minimal$ cat /proc/106597/maps
00400000-00401000 r--p 00000000 08:05 408309                             /home/vgalaxy/Desktop/virtual-machine-repository/code/minimal/a.out
00401000-00402000 r-xp 00001000 08:05 408309                             /home/vgalaxy/Desktop/virtual-machine-repository/code/minimal/a.out
7ffff7ff9000-7ffff7ffd000 r--p 00000000 00:00 0                          [vvar]
7ffff7ffd000-7ffff7fff000 r-xp 00000000 00:00 0                          [vdso]
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

format 可以 man 5 proc 查阅

address perms offset dev inode pathname

注意这里的 vdso 和 vvar，后面会提到

[vdso] The virtual dynamically linked shared object.  See vdso(7).

静态链接

readelf

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000000528 0x0000000000000528  R      0x1000
  LOAD           0x0000000000001000 0x0000000000401000 0x0000000000401000
                 0x000000000008bf1d 0x000000000008bf1d  R E    0x1000
  LOAD           0x000000000008d000 0x000000000048d000 0x000000000048d000
                 0x0000000000027315 0x0000000000027315  R      0x1000
  LOAD           0x00000000000b4908 0x00000000004b5908 0x00000000004b5908
                 0x00000000000059e8 0x00000000000072b8  RW     0x1000
  NOTE           0x0000000000000270 0x0000000000400270 0x0000000000400270
                 0x0000000000000030 0x0000000000000030  R      0x8
  NOTE           0x00000000000002a0 0x00000000004002a0 0x00000000004002a0
                 0x0000000000000044 0x0000000000000044  R      0x4
  TLS            0x00000000000b4908 0x00000000004b5908 0x00000000004b5908
                 0x0000000000000020 0x0000000000000060  R      0x8
  GNU_PROPERTY   0x0000000000000270 0x0000000000400270 0x0000000000400270
                 0x0000000000000030 0x0000000000000030  R      0x8
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x00000000000b4908 0x00000000004b5908 0x00000000004b5908
                 0x00000000000036f8 0x00000000000036f8  R      0x1

可以发现多了堆区和数据区

temp$ cat /proc/113873/maps
00400000-00401000 r--p 00000000 08:05 13811                              /home/vgalaxy/Templates/temp/a.out
00401000-0048d000 r-xp 00001000 08:05 13811                              /home/vgalaxy/Templates/temp/a.out
0048d000-004b5000 r--p 0008d000 08:05 13811                              /home/vgalaxy/Templates/temp/a.out
004b5000-004bc000 rw-p 000b4000 08:05 13811                              /home/vgalaxy/Templates/temp/a.out
004bc000-004bd000 rw-p 00000000 00:00 0                                  [heap]
7ffff7ff9000-7ffff7ffd000 r--p 00000000 00:00 0                          [vvar]
7ffff7ffd000-7ffff7fff000 r-xp 00000000 00:00 0                          [vdso]
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

动态链接

readelf

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000002d8 0x00000000000002d8  R      0x8
  INTERP         0x0000000000000318 0x0000000000000318 0x0000000000000318
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x00000000000005d8 0x00000000000005d8  R      0x1000
  LOAD           0x0000000000001000 0x0000000000001000 0x0000000000001000
                 0x00000000000001c5 0x00000000000001c5  R E    0x1000
  LOAD           0x0000000000002000 0x0000000000002000 0x0000000000002000
                 0x0000000000000130 0x0000000000000130  R      0x1000
  LOAD           0x0000000000002df0 0x0000000000003df0 0x0000000000003df0
                 0x0000000000000220 0x0000000000000228  RW     0x1000
  DYNAMIC        0x0000000000002e00 0x0000000000003e00 0x0000000000003e00
                 0x00000000000001c0 0x00000000000001c0  RW     0x8
  NOTE           0x0000000000000338 0x0000000000000338 0x0000000000000338
                 0x0000000000000030 0x0000000000000030  R      0x8
  NOTE           0x0000000000000368 0x0000000000000368 0x0000000000000368
                 0x0000000000000044 0x0000000000000044  R      0x4
  GNU_PROPERTY   0x0000000000000338 0x0000000000000338 0x0000000000000338
                 0x0000000000000030 0x0000000000000030  R      0x8
  GNU_EH_FRAME   0x0000000000002004 0x0000000000002004 0x0000000000002004
                 0x000000000000003c 0x000000000000003c  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x0000000000002df0 0x0000000000003df0 0x0000000000003df0
                 0x0000000000000210 0x0000000000000210  R      0x1

可以发现多了 INTERP 和 DYNAMIC 段

starti 后

temp$ cat /proc/141484/maps
555555554000-555555555000 r--p 00000000 08:05 13821                      /home/vgalaxy/Templates/temp/a.out
555555555000-555555556000 r-xp 00001000 08:05 13821                      /home/vgalaxy/Templates/temp/a.out
555555556000-555555557000 r--p 00002000 08:05 13821                      /home/vgalaxy/Templates/temp/a.out
555555557000-555555559000 rw-p 00002000 08:05 13821                      /home/vgalaxy/Templates/temp/a.out
7ffff7fc3000-7ffff7fc7000 r--p 00000000 00:00 0                          [vvar]
7ffff7fc7000-7ffff7fc9000 r-xp 00000000 00:00 0                          [vdso]
7ffff7fc9000-7ffff7fca000 r--p 00000000 08:01 137913                     /usr/lib/x86_64-linux-gnu/ld-2.33.so
7ffff7fca000-7ffff7ff1000 r-xp 00001000 08:01 137913                     /usr/lib/x86_64-linux-gnu/ld-2.33.so
7ffff7ff1000-7ffff7ffb000 r--p 00028000 08:01 137913                     /usr/lib/x86_64-linux-gnu/ld-2.33.so
7ffff7ffb000-7ffff7fff000 rw-p 00031000 08:01 137913                     /usr/lib/x86_64-linux-gnu/ld-2.33.so
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

此时有 ld-2.33.so

执行到 main 处

temp$ cat /proc/141484/maps
555555554000-555555555000 r--p 00000000 08:05 13821                      /home/vgalaxy/Templates/temp/a.out
555555555000-555555556000 r-xp 00001000 08:05 13821                      /home/vgalaxy/Templates/temp/a.out
555555556000-555555557000 r--p 00002000 08:05 13821                      /home/vgalaxy/Templates/temp/a.out
555555557000-555555558000 r--p 00002000 08:05 13821                      /home/vgalaxy/Templates/temp/a.out
555555558000-555555559000 rw-p 00003000 08:05 13821                      /home/vgalaxy/Templates/temp/a.out
7ffff7db9000-7ffff7dbb000 rw-p 00000000 00:00 0
7ffff7dbb000-7ffff7de1000 r--p 00000000 08:01 138129                     /usr/lib/x86_64-linux-gnu/libc-2.33.so
7ffff7de1000-7ffff7f4c000 r-xp 00026000 08:01 138129                     /usr/lib/x86_64-linux-gnu/libc-2.33.so
7ffff7f4c000-7ffff7f98000 r--p 00191000 08:01 138129                     /usr/lib/x86_64-linux-gnu/libc-2.33.so
7ffff7f98000-7ffff7f9b000 r--p 001dc000 08:01 138129                     /usr/lib/x86_64-linux-gnu/libc-2.33.so
7ffff7f9b000-7ffff7f9e000 rw-p 001df000 08:01 138129                     /usr/lib/x86_64-linux-gnu/libc-2.33.so
7ffff7f9e000-7ffff7fa9000 rw-p 00000000 00:00 0
7ffff7fc3000-7ffff7fc7000 r--p 00000000 00:00 0                          [vvar]
7ffff7fc7000-7ffff7fc9000 r-xp 00000000 00:00 0                          [vdso]
7ffff7fc9000-7ffff7fca000 r--p 00000000 08:01 137913                     /usr/lib/x86_64-linux-gnu/ld-2.33.so
7ffff7fca000-7ffff7ff1000 r-xp 00001000 08:01 137913                     /usr/lib/x86_64-linux-gnu/ld-2.33.so
7ffff7ff1000-7ffff7ffb000 r--p 00028000 08:01 137913                     /usr/lib/x86_64-linux-gnu/ld-2.33.so
7ffff7ffb000-7ffff7ffd000 r--p 00031000 08:01 137913                     /usr/lib/x86_64-linux-gnu/ld-2.33.so
7ffff7ffd000-7ffff7fff000 rw-p 00033000 08:01 137913                     /usr/lib/x86_64-linux-gnu/ld-2.33.so
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

已经加载好了 libc

一些 pathname 为空的部分是主程序和 libc 的 bss 段

可以在主程序开头添加 char arr[1 << 30];

temp$ pmap 150159
150159:   /home/vgalaxy/Templates/temp/a.out
0000555555554000      4K r---- a.out
0000555555555000      4K r-x-- a.out
0000555555556000      4K r---- a.out
0000555555557000      4K r---- a.out
0000555555558000      4K rw--- a.out
0000555555559000 1048576K rw---   [ anon ]
00007ffff7db9000      8K rw---   [ anon ]
00007ffff7dbb000    152K r---- libc-2.33.so
00007ffff7de1000   1452K r-x-- libc-2.33.so
00007ffff7f4c000    304K r---- libc-2.33.so
00007ffff7f98000     12K r---- libc-2.33.so
00007ffff7f9b000     12K rw--- libc-2.33.so
00007ffff7f9e000     44K rw---   [ anon ]
00007ffff7fc3000     16K r----   [ anon ]
00007ffff7fc7000      8K r-x--   [ anon ]
00007ffff7fc9000      4K r---- ld-2.33.so
00007ffff7fca000    156K r-x-- ld-2.33.so
00007ffff7ff1000     40K r---- ld-2.33.so
00007ffff7ffb000      8K r---- ld-2.33.so
00007ffff7ffd000      8K rw--- ld-2.33.so
00007ffffffde000    132K rw---   [ stack ]
ffffffffff600000      4K --x--   [ anon ]
 total          1050956K

summary

于是总结如下，进程的地址空间是若干连续的段

段的内存可以访问
不在段内/违反权限的内存访问触发 SIGSEGV
gdb 可以越权访问，但不能访问不存在的地址（操作系统开了后门，这是伏笔）

vdso

只读的系统调用也许可以不陷入内核执行

关键思想 → 使用共享内存和内核通信

一个例子 time

时间内核维护秒级的时间，所有进程映射同一个页面

使用 gdb 调试

发现如下的指令

0x7ffff7fc7901      lea    -0x4888(%rip),%r11        # 0x7ffff7fc3080

而这个内存区域位于 vvar 中

7ffff7fc3000-7ffff7fc7000 r--p 00000000 00:00 0                          [vvar]

系统调用的实现

int 指令的代价太大，于是有了 syscall

SYSCALL — Fast System Call

RCX    <- RIP; (* 下条指令执行的地址 *)
RIP    <- IA32_LSTAR;
R11    <- RFLAGS;
RFLAGS <- RFLAGS & ~(IA32_FMASK);
CPL    <- 0; (* 进入 Ring 0 执行 *)
CS.Selector <- IA32_STAR[47:32] & 0xFFFC
SS.Selector <- IA32_STAR[47:32] + 8;

进程的地址空间管理

之前观察到，libc 可以动态被加载

这需要如下的 APIs

管理进程地址空间的系统调用

// 映射
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
int munmap(void *addr, size_t length);

// 修改映射权限
int mprotect(void *addr, size_t length, int prot);

注意 map 的参数 fd 和 offset

这代表可以把文件映射到进程地址空间

于是 ELF loader 用 mmap 非常容易实现

解析出要加载哪部分到内存，直接 mmap 就完了

两个例子

mmap-alloc.c

用 mmap 申请大量内存空间，瞬间完成

也许是标记

00007ffff7db9000      8K rw---   [ anon ]

可以看到变化

00007fff37db9000 3145736K rw---   [ anon ]

pathname 为空，不知道是什么区域

mmap-disk.py

#!/usr/bin/env python3

import mmap, hexdump

with open('/dev/sda', 'rb') as fp:
    mm = mmap.mmap(fp.fileno(), prot=mmap.PROT_READ, length=128 << 30)
    hexdump.hexdump(mm[:512])

用 mmap 映射整个磁盘，瞬间完成

pip 后还是会报错

ModuleNotFoundError: No module named 'hexdump'

理论上会 dump 主引导扇区的前 512 字节

文件和内存的一致性问题 → msync(2)

地址空间的隔离

每个 *ptr 都只能访问本进程（状态机）的内存

除非 mmap 显示指定、映射共享文件或共享内存多线程
实现了操作系统最重要的功能：进程之间的隔离

但是我们有 gdb

游戏修改器

金山游侠

想象成是另一个进程内存的调试器

在进程的内存中找到重要属性并且改掉

关键代码

sprintf(buf, "/proc/%d/mem", pid);
fd = open(buf, O_RDWR);

得到游戏的进程号，并以读写文件的方式访问其内存

找到对应的内存位置修改即可

2000 → 1700

按键精灵

给进程发送键盘/鼠标事件

做个驱动；或者利用操作系统/窗口管理器提供的 API

https://github.com/jordansissel/xdotool

变速齿轮

本质是欺骗进程的时钟

源头：闹钟、睡眠、gettimeofday

代码注入

透视

计算机图形学

render(objects) -> render_hacked(objects)

软件热补丁

代码可以静态/动态/vtable/DLL… 注入

https://zhuanlan.zhihu.com/p/425845057

下面介绍动态代码注入

Dynamic Software Update, DSU

关键代码如下

struct jmp {
  uint32_t opcode : 8;
  int32_t offset : 32;
} __attribute__((packed));

#define JMP(off) ((struct jmp){0xe9, off - sizeof(struct jmp)})

static inline bool within_page(void *addr) {
  return (uintptr_t)addr % PG_SIZE + sizeof(struct jmp) <= PG_SIZE;
}

void DSU(void *old, void *new) {
  void *base = (void *)((uintptr_t)old & ~(PG_SIZE - 1));
  size_t len = PG_SIZE * (within_page(old) ? 1 : 2);
  int flags = PROT_WRITE | PROT_READ | PROT_EXEC;
  if (mprotect(base, len, flags) == 0) {
    *(struct jmp *)old = JMP((char *)new - (char *)old); // **PATCH**
    mprotect(base, len, flags & ~PROT_WRITE);
  } else {
    perror("DSU fail");
  }
}

一些变量的值

old=0x5555555551a9 <foo>
new=0x5555555551d3 <foo_new>
base=0x555555555000 <_init>
len=4096
sizeof(struct jmp)=5

hooking 之前

00000000000011a9 <foo>:
    11a9:       f3 0f 1e fa             endbr64
    11ad:       48 83 ec 08             sub    $0x8,%rsp

hooking 之后

(gdb) x/2 0x5555555551a9
0x5555555551a9 <foo>:   0x000025e9      0x08ec8300

实际上修改 foo 开头的指令序列为

e9 25 00 00 00

实际上跳转到 0x5555555551a9 + 0x25 + 0x5 = 0x5555555551d3

也就是 foo_new

跨页的情形需要多修改一个页面的权限

感觉有点像 attack lab

矛与盾

控制/数据流完整性
- 保护进程的完整性
- 保护隐私数据不被其他进程读写
AI 监控/社会工程学：如果你强得不正常，当然要盯上你
云/沙盒渲染：计算不再信任操作系统

系统调用和 UNIX Shell

Shell

内核 Kernel 提供系统调用

Shell 封装操作系统 API，提供用户接口

Shell 是一门把用户指令翻译成系统调用的编程语言

RTFM → man sh

复刻经典

sh-xv6.c

零库函数依赖

-ffreestanding 编译

A freestanding environment is one in which the standard library may not exist, and program startup may not necessarily be at "main".

不链接库函数

从 _start 处开始执行

ld 链接

可以作为最小 Linux 的 init 程序

在最小的 Linux 上测试

RTFSC

阅读之前提示

使用 gdb 调试

set follow-fork-mode
set follow-exec-mode

观察 sh-xv6.c 中的系统调用

分离 strace 的输出和 shell 的输出

$ strace -f -o strace.log ./sh-xv6
$ tail -f strace.log

strace.log 会动态的增加

更好的观察系统调用 → 将进程绑定 CPU 核心

https://www.jianshu.com/p/f59d7df06432

cd 为内置命令，不使用 fork + execve

例如

> cd ..

对应

637254 read(0, "c", 1)                  = 1
637254 read(0, "d", 1)                  = 1
637254 read(0, " ", 1)                  = 1
637254 read(0, ".", 1)                  = 1
637254 read(0, ".", 1)                  = 1
637254 read(0, "\n", 1)                 = 1
637254 chdir("..")                      = 0

对于其他命令

if (syscall(SYS_fork) == 0)
    runcmd(parsecmd(buf));

parsecmd 解析命令

注意解析命令时，使用 zalloc 为字符串分配空间

不使用 free，利用子进程返回后，OS 自动释放内存

然后使用 runcmd 运行命令

这里的 syscall 如下

// Minimum runtime library
long syscall(int num, ...) {
  va_list ap;
  va_start(ap, num);
  register long a0 asm ("rax") = num;
  register long a1 asm ("rdi") = va_arg(ap, long);
  register long a2 asm ("rsi") = va_arg(ap, long);
  register long a3 asm ("rdx") = va_arg(ap, long);
  register long a4 asm ("r10") = va_arg(ap, long);
  va_end(ap);
  asm volatile("syscall"
    : "+r"(a0) : "r"(a1), "r"(a2), "r"(a3), "r"(a4)
    : "memory", "rcx", "r8", "r9", "r11");
  return a0;
}

通过 man syscall 查阅 ABI

下面分析 runcmd 函数

注意由于没有环境变量，需要 cd /bin 才能方便的运行一些 coreutils

将命令分为

EXEC

为叶子节点，使用 SYS_execve 系统调用

例如

> minimal

系统调用如下

637254 read(0, "m", 1)                  = 1
637254 read(0, "i", 1)                  = 1
637254 read(0, "n", 1)                  = 1
637254 read(0, "i", 1)                  = 1
637254 read(0, "m", 1)                  = 1
637254 read(0, "a", 1)                  = 1
637254 read(0, "l", 1)                  = 1
637254 read(0, "\n", 1)                 = 1
637254 fork()                           = 637896
637254 wait4(-1,  <unfinished ...>
637896 execve("minimal", ["minimal"], NULL) = 0
637896 write(1, "\33[01;31mHello, OS World\33[0m\n", 28) = 28
637896 exit(1)                          = ?
637896 +++ exited with 1 +++
637254 <... wait4 resumed>NULL, 0, NULL) = 637896
637254 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=637896, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---

经典 fork + execve

REDIR

使用 SYS_open 系统调用

例如

> minimal > out

系统调用如下

709295 read(0, "m", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "n", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "m", 1)                  = 1
709295 read(0, "a", 1)                  = 1
709295 read(0, "l", 1)                  = 1
709295 read(0, " ", 1)                  = 1
709295 read(0, ">", 1)                  = 1
709295 read(0, " ", 1)                  = 1
709295 read(0, "o", 1)                  = 1
709295 read(0, "u", 1)                  = 1
709295 read(0, "t", 1)                  = 1
709295 read(0, "\n", 1)                 = 1
709295 fork()                           = 709631
709631 close(1 <unfinished ...>
709295 wait4(-1,  <unfinished ...>
709631 <... close resumed>)             = 0
709631 open("out", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 1
709631 execve("minimal", ["minimal"], NULL) = 0
709631 write(1, "\33[01;31mHello, OS World\33[0m\n", 28) = 28
709631 exit(1)                          = ?
709631 +++ exited with 1 +++
709295 <... wait4 resumed>NULL, 0, NULL) = 709631
709295 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=709631, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---

先关闭 stdout，然后打开 out 文件，使用最小可用的文件描述符，即为 1，相当于重定向到 stdout

然后执行子命令 minimal

LIST

顺序执行命令

例如

> minimal ; minimal

系统调用如下

709295 read(0, "m", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "n", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "m", 1)                  = 1
709295 read(0, "a", 1)                  = 1
709295 read(0, "l", 1)                  = 1
709295 read(0, " ", 1)                  = 1
709295 read(0, ";", 1)                  = 1
709295 read(0, " ", 1)                  = 1
709295 read(0, "m", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "n", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "m", 1)                  = 1
709295 read(0, "a", 1)                  = 1
709295 read(0, "l", 1)                  = 1
709295 read(0, "\n", 1)                 = 1
709295 fork()                           = 709716
709295 wait4(-1,  <unfinished ...>
709716 fork()                           = 709717
709717 execve("minimal", ["minimal"], NULL <unfinished ...>
709716 wait4(-1,  <unfinished ...>
709717 <... execve resumed>)            = 0
709717 write(1, "\33[01;31mHello, OS World\33[0m\n", 28) = 28
709717 exit(1)                          = ?
709717 +++ exited with 1 +++
709716 <... wait4 resumed>NULL, 0, NULL) = 709717
709716 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=709717, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---
709716 execve("minimal", ["minimal"], NULL) = 0
709716 write(1, "\33[01;31mHello, OS World\33[0m\n", 28) = 28
709716 exit(1)                          = ?
709716 +++ exited with 1 +++
709295 <... wait4 resumed>NULL, 0, NULL) = 709716
709295 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=709716, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---

多一次 fork 来执行子命令 minimal

回顾：

cmd1; cmd2

先执行 cmd1，不管 cmd1 是否出错，接下来执行 cmd2

通过 $? 观察执行后返回的状态，0 表示没有错误，非 0 表示有错误

cmd1 && cmd2

只有当 cmd1 正确运行完毕后，才执行 cmd2

cmd1 || cmd2

只有当 cmd2 出错后，才执行 cmd2

PIPE

管道

例如

> minimal | /bin/wc -l

系统调用如下

709295 read(0, "m", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "n", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "m", 1)                  = 1
709295 read(0, "a", 1)                  = 1
709295 read(0, "l", 1)                  = 1
709295 read(0, " ", 1)                  = 1
709295 read(0, "|", 1)                  = 1
709295 read(0, " ", 1)                  = 1
709295 read(0, "/", 1)                  = 1
709295 read(0, "b", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "n", 1)                  = 1
709295 read(0, "/", 1)                  = 1
709295 read(0, "w", 1)                  = 1
709295 read(0, "c", 1)                  = 1
709295 read(0, " ", 1)                  = 1
709295 read(0, "-", 1)                  = 1
709295 read(0, "l", 1)                  = 1
709295 read(0, "\n", 1)                 = 1
709295 fork()                           = 709884
709884 pipe( <unfinished ...>
709295 wait4(-1,  <unfinished ...>
709884 <... pipe resumed>[3, 4])        = 0
709884 fork( <unfinished ...>
709885 close(1 <unfinished ...>
709884 <... fork resumed>)              = 709885
709885 <... close resumed>)             = 0
709884 fork( <unfinished ...>
709885 dup(4 <unfinished ...>
709886 close(0 <unfinished ...>
709885 <... dup resumed>)               = 1
709884 <... fork resumed>)              = 709886
709885 close(3 <unfinished ...>
709884 close(3 <unfinished ...>
709886 <... close resumed>)             = 0
709885 <... close resumed>)             = 0
709884 <... close resumed>)             = 0
709886 dup(3 <unfinished ...>
709885 close(4 <unfinished ...>
709884 close(4 <unfinished ...>
709886 <... dup resumed>)               = 0
709886 close(3 <unfinished ...>
709885 <... close resumed>)             = 0
709884 <... close resumed>)             = 0
709886 <... close resumed>)             = 0
709885 execve("minimal", ["minimal"], NULL <unfinished ...>
709884 wait4(-1,  <unfinished ...>
709886 close(4)                         = 0
709885 <... execve resumed>)            = 0
709886 execve("/bin/wc", ["/bin/wc", "-l"], NULL <unfinished ...>
709885 write(1, "\33[01;31mHello, OS World\33[0m\n", 28) = 28
709885 exit(1 <unfinished ...>
709886 <... execve resumed>)            = 0
709885 <... exit resumed>)              = ?
709886 brk(NULL <unfinished ...>
709885 +++ exited with 1 +++
709884 <... wait4 resumed>NULL, 0, NULL) = 709885
709886 <... brk resumed>)               = 0x55a07da84000
709884 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=709885, si_uid=1000, si_status=1, si_utime=0, si_stime=0} ---
709886 arch_prctl(0x3001 /* ARCH_??? */, 0x7ffcff3f89a0) = -1 EINVAL (Invalid argument)
709884 wait4(-1,  <unfinished ...>
709886 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
709886 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
709886 newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=103526, ...}, AT_EMPTY_PATH) = 0
709886 mmap(NULL, 103526, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fac0707a000
709886 close(3)                         = 0
709886 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
709886 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\206\2\0\0\0\0\0"..., 832) = 832
709886 pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
709886 pread64(3, "\4\0\0\0 \0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 48, 848) = 48
709886 pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0+H)\227\201T\214\233\304R\352\306\3379\220%"..., 68, 896) = 68
709886 newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=1983576, ...}, AT_EMPTY_PATH) = 0
709886 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fac07078000
709886 pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
709886 mmap(NULL, 2012056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac06e8c000
709886 mmap(0x7fac06eb2000, 1486848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7fac06eb2000
709886 mmap(0x7fac0701d000, 311296, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x191000) = 0x7fac0701d000
709886 mmap(0x7fac07069000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1dc000) = 0x7fac07069000
709886 mmap(0x7fac0706f000, 33688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac0706f000
709886 close(3)                         = 0
709886 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fac06e8a000
709886 arch_prctl(ARCH_SET_FS, 0x7fac070795c0) = 0
709886 mprotect(0x7fac07069000, 12288, PROT_READ) = 0
709886 mprotect(0x55a07d4d3000, 4096, PROT_READ) = 0
709886 mprotect(0x7fac070c6000, 8192, PROT_READ) = 0
709886 munmap(0x7fac0707a000, 103526)   = 0
709886 brk(NULL)                        = 0x55a07da84000
709886 brk(0x55a07daa5000)              = 0x55a07daa5000
709886 fadvise64(0, 0, 0, POSIX_FADV_SEQUENTIAL) = -1 ESPIPE (Illegal seek)
709886 read(0, "\33[01;31mHello, OS World\33[0m\n", 16384) = 28
709886 read(0, "", 16384)               = 0
709886 newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x2), ...}, AT_EMPTY_PATH) = 0
709886 write(1, "1\n", 2)               = 2
709886 close(0)                         = 0
709886 close(1)                         = 0
709886 close(2)                         = 0
709886 exit_group(0)                    = ?
709886 +++ exited with 0 +++
709884 <... wait4 resumed>NULL, 0, NULL) = 709886
709884 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=709886, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
709884 exit(0)                          = ?
709884 +++ exited with 0 +++
709295 <... wait4 resumed>NULL, 0, NULL) = 709884
709295 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=709884, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---

中间 trace 了很多 /bin/wc 的系统调用，直接研究源码

    case PIPE:
      pcmd = (struct pipecmd*)cmd;
      assert(syscall(SYS_pipe, p) >= 0);
      if (syscall(SYS_fork) == 0) {
        syscall(SYS_close, 1);
        syscall(SYS_dup, p[1]);
        syscall(SYS_close, p[0]);
        syscall(SYS_close, p[1]);
        runcmd(pcmd->left);
      }
      if (syscall(SYS_fork) == 0) {
        syscall(SYS_close, 0);
        syscall(SYS_dup, p[0]);
        syscall(SYS_close, p[0]);
        syscall(SYS_close, p[1]);
        runcmd(pcmd->right);
      }
      syscall(SYS_close, p[0]);
      syscall(SYS_close, p[1]);
      syscall(SYS_wait4, -1, 0, 0, 0);
      syscall(SYS_wait4, -1, 0, 0, 0);
      break;

首先 pipe 一个管道，p[0] 是读端，p[1] 是写段

然后 fork 出一个子进程，关闭 stdout，将写段覆盖 stdout

最小可用的文件描述符

子进程关闭读端和写段

然后 fork 出一个子进程，关闭 stdin，将读端覆盖 stdin

最小可用的文件描述符

子进程关闭读端和写段

最后父进程关闭读端和写段

效果就是第一个子命令的 stdout 成为了第二个子命令的 stdin

BACK

创建后台进程组

例如

> minimal &
> cd /bin

系统调用如下

709295 read(0, "m", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "n", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "m", 1)                  = 1
709295 read(0, "a", 1)                  = 1
709295 read(0, "l", 1)                  = 1
709295 read(0, " ", 1)                  = 1
709295 read(0, "&", 1)                  = 1
709295 read(0, "\n", 1)                 = 1
709295 fork()                           = 710056
709295 wait4(-1,  <unfinished ...>
710056 fork()                           = 710057
710057 execve("minimal", ["minimal"], NULL <unfinished ...>
710056 exit(0)                          = ?
710056 +++ exited with 0 +++
709295 <... wait4 resumed>NULL, 0, NULL) = 710056
710057 <... execve resumed>)            = 0
709295 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=710056, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
709295 write(2, "> ", 2 <unfinished ...>
710057 write(1, "\33[01;31mHello, OS World\33[0m\n", 28 <unfinished ...>
709295 <... write resumed>)             = 2
710057 <... write resumed>)             = 28
709295 read(0,  <unfinished ...>
710057 exit(1)                          = ?
710057 +++ exited with 1 +++
709295 <... read resumed>"c", 1)        = 1
709295 read(0, "d", 1)                  = 1
709295 read(0, " ", 1)                  = 1
709295 read(0, "/", 1)                  = 1
709295 read(0, "b", 1)                  = 1
709295 read(0, "i", 1)                  = 1
709295 read(0, "n", 1)                  = 1
709295 read(0, "\n", 1)                 = 1
709295 chdir("/bin")                    = 0

注意 exit 的位置

Traps and Pitfalls

操作的优先级

ls > a.txt | cat

不同的 shell 会有不同的响应，如 bash/zsh

文本数据责任自负

基于字符

有空格，后果自负

行为并不总是 intuitive

$ echo hello > /etc/a.txt
bash: /etc/a.txt: Permission denied
$ sudo echo hello > /etc/a.txt
bash: /etc/a.txt: Permission denied

联系之前的源码

在执行 sudo ... 之前，就已经试图打开 /etc/a.txt 进行重定向

终端和 Job control

关于 Job control 可以参考 Shell Lab

一个例子

minimal$ vi run.sh &
[1] 734130
minimal$ jobs
[1]+  Stopped                 vi run.sh

然后使用 fg %1 进入 Vim 界面

再键入 <C-z> 使其成为后台进程

这其中涉及到了信号机制

SIGINT

<C-c>

来自键盘的中断，可以被捕获

SIGQUIT

<C-\>

来自键盘的退出，可以被捕获

SIGTSTP

<C-z>

来自终端的停止信号，可以被捕获

SIGTERM

terminate 程序，kill 命令默认产生这个信号，可以被捕获

$ kill pid

SIGSTOP

不是来自终端的停止信号，不可以被捕获

SIGKILL

杀死程序，不可以被捕获

$ kill -9 pid

使用 kill -l 观察所有信号

捕获信号的例子可以参考 signal-handler.c

注意 fork 出的子进程也会收到信号，因为在同一个终端中

更确切的说

Ctrl-C 是终端设备发的信号，发给 foreground 进程组
所有 fork 出的进程，默认同一个 PGID，都会收到信号

于是需要引入终端、会话、进程组等概念

RTFM → setpgid/getpgid(2)

它解释了 process group, session, controlling terminal 之间的关系

终端是 UNIX 操作系统中一类非常特别的设备

每一次打开 terminal，默认前台运行着 bash shell

然后 shell 通过系统调用创建前台进程组和后台进程组

tmux 实际上在一块屏幕上模拟了多个终端

使用 tty 观察，结果是不一样的

甚至可以通过写入终端的方式进行输入

$ vi /dev/pts/0

fork-printf 可以通过系统调用识别 tty 和管道

可以发现没有使用管道时多了一行系统调用

newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0

C 标准库的实现

Freestanding

可用的头文件

https://en.cppreference.com/w/c/language/conformance

发现了一些有意思的东西

iso646.h

and/or/not

https://en.cppreference.com/w/c/language/operator_alternative

%:include <stdlib.h>
%:include <stdio.h>
%:include <iso646.h>

int main(int argc, char** argv)
??<
    if (argc > 1 and argv<:1:> not_eq NULL)
    <%
       printf("Hello%s\n", argv<:1:>);
    %>

    return EXIT_SUCCESS;
??>

这也是 C 程序

$ gcc -trigraphs a.c

inttypes.h

printf 定宽类型

封装

memset

void *memset(void *s, int c, size_t n) {
  for (size_t i = 0; i < n; i++) {
    ((char *)s)[i] = c;
  }
  return s;
}

考虑编译器优化

parallel101 → 从汇编角度看编译器优化

矢量化 SIMD

考虑数据竞争

标准库只对标准库内部数据的线程安全性负责

如 printf 的 buffer

printf

可变参数列表

stdarg.h

文件描述符

FILE * 封装了文件描述符上的系统调用

使用 gdb 观察

#include <stdio.h>

int main() {
  FILE * fp = fopen("a.txt", "w");
  fprintf(fp, "hello\n");
}

fopen 后

$1 = {
  _flags = -72539004,
  _IO_read_ptr = 0x0,
  _IO_read_end = 0x0,
  _IO_read_base = 0x0,
  _IO_write_base = 0x0,
  _IO_write_ptr = 0x0,
  _IO_write_end = 0x0,
  _IO_buf_base = 0x0,
  _IO_buf_end = 0x0,
  _IO_save_base = 0x0,
  _IO_backup_base = 0x0,
  _IO_save_end = 0x0,
  _markers = 0x0,
  _chain = 0x7ffff7f9c5e0 <_IO_2_1_stderr_>,
  _fileno = 3,
  _flags2 = 0,
  _old_offset = 0,
  _cur_column = 0,
  _vtable_offset = 0 '\000',
  _shortbuf = "",
  _lock = 0x555555559380,
  _offset = -1,
  _codecvt = 0x0,
  _wide_data = 0x555555559390,
  _freeres_list = 0x0,
  _freeres_buf = 0x0,
  __pad5 = 0,
  _mode = 0,
  _unused2 = '\000' <repeats 19 times>
}

注意 _fileno

fprintf 后

$2 = {
  _flags = -72536956,
  _IO_read_ptr = 0x555555559480 "hello\n",
  _IO_read_end = 0x555555559480 "hello\n",
  _IO_read_base = 0x555555559480 "hello\n",
  _IO_write_base = 0x555555559480 "hello\n",
  _IO_write_ptr = 0x555555559486 "",
  _IO_write_end = 0x55555555a480 "",
  _IO_buf_base = 0x555555559480 "hello\n",
  _IO_buf_end = 0x55555555a480 "",
  _IO_save_base = 0x0,
  _IO_backup_base = 0x0,
  _IO_save_end = 0x0,
  _markers = 0x0,
  _chain = 0x7ffff7f9c5e0 <_IO_2_1_stderr_>,
  _fileno = 3,
  _flags2 = 0,
  _old_offset = 0,
  _cur_column = 0,
  _vtable_offset = 0 '\000',
  _shortbuf = "",
  _lock = 0x555555559380,
  _offset = -1,
  _codecvt = 0x0,
  _wide_data = 0x555555559390,
  _freeres_list = 0x0,
  _freeres_buf = 0x0,
  __pad5 = 0,
  _mode = -1,
  _unused2 = '\000' <repeats 19 times>
}

注意 _IO_buf_base

popen 和 pclose

封装 pipe

一个设计有缺陷的 API

FILE *popen(const char *command, const char *type);
int pclose(FILE *stream);

Since a pipe is by definition unidirectional, the type argument may specify only reading or writing, not both; the resulting stream is correspondingly read-only or write-only.

execve

在 M3 中只能使用 execve，发现并不好用

比如 pathname 是不考虑环境变量的

int execve(const char *pathname, char *const argv[], char *const envp[]);

另外 man execve 对于 #! 语法也给出了解释

   Interpreter scripts
       An interpreter script is a text file that has execute permission enabled and whose first line is of the form:

           #!interpreter [optional-arg]

       The interpreter must be a valid pathname for an executable file.

       If  the pathname argument of execve() specifies an interpreter script, then interpreter will be invoked with the follow‐
       ing arguments:

           interpreter [optional-arg] pathname arg...

       where pathname is the absolute pathname of the file specified as the first argument of execve(), and arg...  is the  se‐
       ries  of  words  pointed to by the argv argument of execve(), starting at argv[1].  Note that there is no way to get the
       argv[0] that was passed to the execve() call.

       For portable use, optional-arg should either be absent, or be specified as a single word (i.e., it  should  not  contain
       white space); see NOTES below.

       Since Linux 2.6.28, the kernel permits the interpreter of a script to itself be a script.  This permission is recursive,
       up to a limit of four recursions, so that the interpreter may be a script which is interpreted by a script, and so on.

我们需要高情商的 API

man 3 exec

示例程序

int main() {
  extern char **environ;
  char *argv[] = {"/bin/strace", "ls", ".", NULL};
  execve(argv[0], argv, environ);
  perror("execve");
}

execv

envp 自动设置为调用进程的 environ

int main() {
  char *argv[] = {"/bin/strace", "ls", ".", NULL};
  execv(argv[0], argv);
  perror("execv");
}

这一点并不是 execv 的特点，只是因为没有 e 后缀

All other exec() functions (which do not include 'e' in the suffix) take the environment for the new process image from the external variable environ in the calling process.

如果有 e 后缀，就可以指定 envp

execvp

执行 pathname 时考虑环境变量，就像 shell 一样

int main() {
  char *argv[] = {"strace", "ls", ".", NULL};
  execvp(argv[0], argv);
  perror("execvp");
}

execlp

使用可变参数列表代替 argv

int main() {
  execlp("strace", "strace", "ls", ".", NULL);
  perror("execlp");
}

注意最后的 NULL

system

man 3 system

更方便的写法

int main() {
  char *arg = "strace ls .";
  system(arg);
}

注意 system 会返回

error

man 3 err
man 3 error
man 3 errno

示例程序

#include <err.h>
#include <error.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
  char filename[] = "nonexist";
  FILE *fp = fopen(filename, "r");
  if (!fp) {
    // err(EXIT_FAILURE, "%s", filename);
    // error(EXIT_FAILURE, ENOENT, "%s", filename);
    warn("%s", filename);
  }
}

三行均输出

./a.out: nonexist: No such file or directory

注意 warn 不会 exit(status)

查看 errnum

$ errno -l

另外注意 errno 是 thread-local 的，例证

#include <errno.h>

int main() {
  errno;
}

预编译后

# 3 "a.c"
int main() {

# 4 "a.c" 3 4
 (*__errno_location ())
# 4 "a.c"
      ;
}

https://stackoverflow.com/questions/1694164/is-errno-thread-safe

还有一个 perror 也不会 exit(status)

输出为

msg: [error descriptions]

通常包装起来使用

#define handle_error(msg)                                                      \
  do {                                                                         \
    perror(msg);                                                               \
    exit(EXIT_FAILURE);                                                        \
  } while (0)

environ

man 7 environ

实现 env 指令

#include <stdio.h>

int main() {
  extern char **environ;
  for (char **env = environ; *env; env++) {
    printf("%s\n", *env);
  }
}

观察 environ 是如何被赋值的

静态链接

Watchpoint 1: (char **)environ

Old value = (char **) 0x0
New value = (char **) 0x7fffffffdce8
0x0000000000402a23 in __libc_start_main ()

动态链接

starti 后观察不到

Program stopped.
0x00007ffff7fca0d0 in ?? () from /lib64/ld-linux-x86-64.so.2
(gdb) p environ
No symbol "environ" in current context.

到 _start 后

Breakpoint 1, 0x0000555555555060 in _start ()
(gdb) p environ
$1 = (char **) 0x0

打 watchpoint 失效，始终为 (char **) 0x0

怀疑 libc 库没有调试信息

https://stackoverflow.com/questions/10000335/how-to-use-debug-version-of-libc

(gdb) info variables
...
Non-debugging symbols:
...
0x0000555555558020  __environ@GLIBC_2.2.5
0x0000555555558020  environ@GLIBC_2.2.5
...

malloc 和 free

L1 实验指南

脱离 workload 做优化就是耍流氓

https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf

设置两套系统：

fast path
- 性能极好、并行度极高、覆盖大部分情况
- 但有小概率会失败 → fall back to slow path
- Segregated List (Slab)
slow path
- 不在乎那么快，但把困难的事情做好
- 计算机系统里有很多这样的例子，比如 cache
- Buddy system

RTFM

https://www.gnu.org/software/libc/manual/

https://sourceware.org/newlib/

A fork() in the Road

fork() 行为的补充解释

offset

共享文件描述符的 offset

#include <assert.h>
#include <fcntl.h>
#include <unistd.h>

int main() {
  int fd = open("a.txt", O_WRONLY | O_CREAT);
  assert(fd > 0);
  pid_t pid = fork();
  assert(pid >= 0);
  if (pid == 0) {
    write(fd, "Hello", 5);
  } else {
    write(fd, "World", 5);
  }
  close(fd);
}

RTFM: write(2), BUGS section

另外 dup 的两个文件描述符也是共享 offset

copy-on-write

概念上状态机被复制，但实际上复制后内存都被共享

被复制后，整个地址空间都被标记为只读

当写页面时，操作系统捕获 Page Fault，并酌情复制页面

操作系统会维护每个页面的引用计数

证明 -> cow-test.c

推论 -> 统计进程占用的内存是个伪命题

状态机、fork() 和魔法

搜索并行化

加速状态空间搜索

每次探索都 fork 一个新进程

不需要回溯，直接 exit 即可

跳过初始化

初始化代价很大

int main() {
  nemu_init(); // only once
  while (1) {
    file = get_start_request();
    if ((pid = fork()) == 0) {
      // bad practice: no error checking
      load_file();
    }
    ...

相当于备份了初始化的状态

实际应用

Zygote Process (Android)
- Java Virtual Machine 初始化涉及大量的类加载
- 一次加载，全员使用
  - App 使用的系统资源
  - 基础类库
  - libc
  - …
Chrome site isolation (Chrome)
Fork server (AFL)

备份和容错

虚拟机快照 yyds

主进程 crash 了，启动快照重新执行

有些 bug 可能调整一下环境就消失了，比如并发

POSIX Spawn

如果只有内存和文件描述符，fork + execve 是十分优雅的方案

但是我们还有

信号

信号处理程序，操作系统负责维护

线程

Linux 为线程提供了 clone 系统调用

进程间通信对象
……

于是 fork 的设计越来越复杂

这篇论文 https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf 罗列了 fork 的罪行

我们有新设计的 API

int posix_spawn(pid_t *pid, char *path,
  posix_spawn_file_actions_t *file_actions,
  posix_spawnattr_t *attrp,
  char * argv[], char * envp[]);

参数

pid: 返回的进程号
path: 程序，重置的状态机
file_actions: open, close, dup
attrp: 信号、进程组等信息
argv, envp: 同 execve

手册中 https://man7.org/linux/man-pages/man3/posix_spawn.3.html 给出了一个例子

可执行文件

目前只考虑静态链接

RTFM

http://jyywiki.cn/pages/OS/manuals/sysv-abi.pdf

https://refspecs.linuxbase.org/

状态机的描述

一个描述了状态机的初始状态 + 迁移的数据结构

经典的 $(M,R)$ 模型

操作系统上的可执行文件

execve

$ file a.out
a.out: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=17d8088a8948dd3853fb50eaabbd1ed2e3bdc7b2, for GNU/Linux 4.4.0, not stripped
$ file a.c
a.c: C source, ASCII text

hack a.c

$ chmod u+x a.c
$ strace ./a.c
execve("./a.c", ["./a.c"], 0x7fffe4328600 /* 87 vars */) = -1 ENOEXEC (Exec format error)
strace: exec: Exec format error

hack a.out

$ chmod u-x a.out
$ strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffe4ef3c130 /* 87 vars */) = -1 EACCES (Permission denied)
strace: exec: Permission denied

RTFM -> execve ERRORS

She-bang

以 #! 开头

───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: a.c
       │ Size: 140 B
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ #include <stdio.h>
   2   │
   3   │ int main(int argc, char *argv[]) {
   4   │   for (int i = 0; i < argc; ++i) {
   5   │     printf("argv[%d] -> %s\n", i, argv[i]);
   6   │   }
   7   │ }
───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: demo
       │ Size: 24 B
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ #!././a.out Hello World
───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

运行

$ ./demo 1 2 3
argv[0] -> ././a.out
argv[1] -> Hello World
argv[2] -> ./demo
argv[3] -> 1
argv[4] -> 2
argv[5] -> 3

注意参数的顺序

Hello World 视为一个参数

解析可执行文件

GNU binutils

生成可执行文件
- ld (linker), as (assembler)
- ar, ranlib
分析可执行文件 - 静态
- objcopy/objdump/readelf
- addr2line, size, nm
分析可执行文件 - 动态
- gdb

调试信息

编译器

C = \text{Compile}(S)

调试信息完成了不完美的逆变换（考虑到编译优化）

S = \text{DebugInfo}(C)

The DWARF Debugging Standard

定义了一个 Turing Complete 的指令集 DW_OP_XXX
可以执行任意计算将当前机器状态映射回 C

观察

汇编

使用 -g -S 选项

.Ldebug_info0:
...

目标文件

使用 -g -c 选项

$ readelf -w foo.o

可执行文件

使用 -g 选项

$ addr2line 401645
/home/vgalaxy/Desktop/virtual-machine-repository/code/virtualization/popcount.c:4

其中 401645 为 popcount 函数首地址

栈回溯信息

考虑 x86 函数调用的栈帧结构

几乎所有的函数开头都是

push   %rbp
mov    %rsp,%rbp

考虑到函数调用，有

push   retaddr
jmp    f

即栈中存在关系

┌─────────────┐
│     ret     │
├─────────────┤
│     rbp     │◄───────┐
├─────────────┤        │
│             │        │
│             │        │
│             │        │
│             │        │
│             │        │
│             │        │
├─────────────┤        │
│     ret     │        │
├─────────────┤        │
│     rbp     ├────────┘
├─────────────┤
│             │
│             │
└─────────────┘

所以可以利用 rbp 进行栈回溯

struct frame {
  struct frame *next; // push %rbp
  void *addr;         // call f (pushed retaddr)
};

void backtrace() {
  struct frame *f;
  char cmd[1024];
  extern char end;

  asm volatile ("movq %%rbp, %0" : "=g"(f));
  for (; f->addr < (void *)&end; f = f->next) {
    printf("%016lx  ", (long)f->addr); fflush(stdout);
    sprintf(cmd, "addr2line -e %s %p", binary, f->addr);
    system(cmd);
  }
}

结果比较粗糙，还会 segfault

0000000000401729  /home/vgalaxy/Desktop/virtual-machine-repository/code/virtualization/unwind.c:26
000000000040173a  /home/vgalaxy/Desktop/virtual-machine-repository/code/virtualization/unwind.c:30
0000000000401764  /home/vgalaxy/Desktop/virtual-machine-repository/code/virtualization/unwind.c:34
0000000000401aea  libc-start.o:?

需要使用 -fno-omit-frame-pointer 选项

逆向工程

https://hex-rays.com/

调试信息是绝对不可能了
连符号表都没有

stripped

$ strip ./a.out
$ nm ./a.out
nm: ./a.out: no symbols

看起来就是一串指令序列

重定位

相对地址

void hello();

int main() {
  hello();
}

// hello.c
#include <stdio.h>

void hello() {
  ...
}

反汇编 main.o

0000000000000000 <main>:
   0:   48 83 ec 08             sub    $0x8,%rsp
   4:   31 c0                   xor    %eax,%eax
   6:   e8 00 00 00 00          call   b <main+0xb>
   b:   31 c0                   xor    %eax,%eax
   d:   48 83 c4 08             add    $0x8,%rsp
  11:   c3                      ret

重定位有 assertion

assert(
  (char *)hello ==
    (char *)main + 0x7 + // call hello 的 next PC
    *(int32_t *)((uintptr_t)main + 0xb) // call 指令中的 offset
);

可以显示写入 hello.c 中

  char *p = (char *)main + 0x6 + 1;
  int32_t offset = *(int32_t *)p;
  assert((char *)main + 0xb + offset == (char *)hello);

与指令密切相关

所以可重定位目标文件是部分状态机的容器

assertion 存在于 ELF 文件中

  Offset            Type      Sym. Name + Addend
000000000007  R_X86_64_PLT32  hello - 4

重填 32-bit value 为 S + A - P

S = hello
A = -4
P = main + 0x7

整个编译工具链

编译器 (gcc)

High-level semantics C 状态机 → low-level semantics 汇编

汇编器 (as)

Low-level semantics → Binary semantics 状态机容器
- 一一对应地翻译成二进制代码
  - sections, symbols, debug info
- 不能决定的要留下之后怎么办的信息
  - relocations

链接器 (ld)

合并所有容器，得到一个完整的状态机
- ldscript (-Wl,--verbose); 和 C Runtime Objects (CRT) 链接
- missing/duplicate symbol 会出错

ELF 的细节

ELF 就是一个容器数据结构，包含了必要的信息

完全可以试着自己定义二进制文件格式

收敛到 ELF + 理解 FM

可执行文件的加载

静态 ELF 加载器

解析数据结构

/usr/include/elf.h

结构体已经定义好了

复制到内存

使用 mmap，在 loader 的地址空间中映射

祭出 PA 的图

      +-------+---------------+-----------------------+
      |       |...............|                       |
      |       |...............|                       |  ELF file
      |       |...............|                       |
      +-------+---------------+-----------------------+
      0       ^               |
              |<------+------>|
              |       |       |
              |       |
              |       +----------------------------+
              |                                    |
   Type       |   Offset    VirtAddr    PhysAddr   |FileSiz  MemSiz   Flg  Align
   LOAD       +-- 0x001000  0x03000000  0x03000000 +0x1d600  0x27240  RWE  0x1000
                               |                       |       |
                               |   +-------------------+       |
                               |   |                           |
                               |   |     |           |         |
                               |   |     |           |         |
                               |   |     +-----------+ ---     |
                               |   |     |00000000000|  ^      |
                               |   | --- |00000000000|  |      |
                               |   |  ^  |...........|  |      |
                               |   |  |  |...........|  +------+
                               |   +--+  |...........|  |
                               |      |  |...........|  |
                               |      v  |...........|  v
                               +-------> +-----------+ ---
                                         |           |
                                         |           |
                                            Memory

创建进程运行时初始状态

System V ABI Figure 3.9 Initial Process Stack

宏技巧

#define push(sp, T, ...) ({ *((T*)sp) = (T)__VA_ARGS__; sp = (void *)((uintptr_t)(sp) + sizeof(T)); })

那么

// argc
while (argv[argc]) argc++;
push(sp, intptr_t, argc);
// argv[], NULL-terminate
for (int i = 0; i <= argc; i++)
  push(sp, intptr_t, argv[i]);

即

while (argv[argc]) argc++;
({ *((intptr_t*)sp) = (intptr_t)argc;
  sp = (void *)((uintptr_t)(sp) + sizeof(intptr_t));
 });
for (int i = 0; i <= argc; i++)
  ({ *((intptr_t*)sp) = (intptr_t)argv[i];
    sp = (void *)((uintptr_t)(sp) + sizeof(intptr_t));
   });

跳转

  asm volatile(
    "mov $0, %%rdx;" // required by ABI
    "mov %0, %%rsp;"
    "jmp *%1" : : "a"(sp_exec), "b"(h->e_entry));

实际上，我们使用 read + mmap + close 系统调用实现了 execve 系统调用

也就是说 OS 只需要加载 loader 一个程序，其余的程序让 loader 加载即可

即其余的程序的初始状态处为 loader 的代码

将加载的代码从内核态转移到用户态

Boot Block Loader

abstract-machine/am/src/x86/qemu/boot/main.c

详见 操作系统的状态机模型

机制完全一致，只不过从 mmap 变成了显式操纵磁盘和内存

Linux Kernel ELF Loader

解压缩后

$ make menuconfig
$ make bzImage -j8
...
Kernel: arch/x86/boot/bzImage is ready  (#1)

瞬间发现 ccache 的好处

然后用 bzImage 替换 linux-minimal 中的 vmlinuz

qemu-system-x86_64: Error loading uncompressed kernel without PVH ELF Note

似乎配置时关闭 Networking support 就没问题了

使用现代的工具 vscode + compile_commands

动态链接和加载

核心 -> 查表

设计一个新的二进制文件格式 `.dl`

用最小代价为 .dl 文件配齐全套工具链

生成，开局一条狗，出门全靠偷
- 假设编译器可以生成位置无关代码 (PIC)
- as = GNU as
- ld = objcopy
分析，自己写
- readdl (readelf)
- objdump
加载，自己写
- loader

示例 main.S

#include "dl.h"

DL_HEAD

LOAD("libc.dl")
LOAD("libhello.dl")
IMPORT(hello)
EXPORT(main)

DL_CODE

main:
  call DSYM(hello)
  call DSYM(hello)
  call DSYM(hello)
  call DSYM(hello)
  movq $0, %rax
  ret

DL_END

gcc -E 输出为

__hdr:
  .ascii "\x01\x14\x05\x14";
  .4byte (__end - __hdr);
  .4byte (__code - __hdr)

.align 32, 0;
  .8byte (0);
  .ascii "+" "libc.dl" "\0"

.align 32, 0;
  .8byte (0);
  .ascii "+" "libhello.dl" "\0"

.align 32, 0;
hello:
  .8byte (0);
  .ascii "?" "hello" "\0"

.align 32, 0;
  .8byte (main - __hdr);
  .ascii "#" "main" "\0"

.fill 32 - 1, 1, 0;

.align 32, 0;
__code:

main:
  call *hello(%rip)
  call *hello(%rip)
  call *hello(%rip)
  call *hello(%rip)
  movq $0, %rax
  ret

__end:

as + objcopy -> main.dl

00000000: 0114 0514 e000 0000 c000 0000 0000 0000  ................
00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000020: 0000 0000 0000 0000 2b6c 6962 632e 646c  ........+libc.dl
00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000040: 0000 0000 0000 0000 2b6c 6962 6865 6c6c  ........+libhell
00000050: 6f2e 646c 0000 0000 0000 0000 0000 0000  o.dl............
00000060: 0000 0000 0000 0000 3f68 656c 6c6f 0000  ........?hello..
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: c000 0000 0000 0000 236d 6169 6e00 0000  ........#main...
00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: ff15 9aff ffff ff15 94ff ffff ff15 8eff  ................
000000d0: ffff ff15 88ff ffff 48c7 c000 0000 00c3  ........H.......
000000e0: 0a                                       .

文件格式如下

┌─────────────────┐
│      magic      │
├─────────────────┤
│    file size    │
├─────────────────┤
│   code offset   │
├────────┬───┬────┤
│  addr  │tag│    │
├────────┴───┘    │
│   symbol name   │
├─────────────────┤
│                 │
│      .....      │
│                 │
├─────────────────┤
│                 │
│     0000000     │
│                 │
├─────────────────┤
│       code      │
└─────────────────┘

下面分析 dlbox.c

生成

as + objcopy

仅拷贝代码到二进制文件的 code 位置

分析

objcopy -> 解析文件之后，遍历符号表 symtab，根据类型输出即可

objdump -> 解析文件之后，使用 ndisasm 解析代码段

加载

解析文件之后，遍历符号表找到 main，然后跳转即可

所以关键在于解析二进制文件，即 dlopen 的实现

遍历符号表 symtab，根据类型进行处理

  for (struct symbol *sym = h->symtab; sym->type; sym++) {
    switch (sym->type) {
      case '+': dlload(sym); break; // (recursively) load
      case '?': sym->offset = (uintptr_t)dlsym(sym->name); break; // resolve
      case '#': dlexport(sym->name, (char *)h + sym->offset); break; // export
    }
  }

dlload 会递归加载 lib 到全局的 libs 中
dlexport 则导出符号到全局的符号表 syms
dlsym 根据全局的 syms 进行地址的重填，填上正确的偏移量

举例来说

main.dl 加载了 libc.dl
libc.dl 中导出了符号 putchar 和 exit
main.dl 接着加载 libhello.dl
libhello.dl 中导入了 putchar，在相应的位置重填上正确的偏移量，并且导出了符号 hello
main.dl 导入了符号 hello，进行地址重填，并导出符号 main 作为起始地址

dl 文件的设计缺陷

存储保护和加载位置

允许将 .dl 中的一部分以某个指定的权限映射到内存的某个位置
program header table

允许自由指定加载器，而不是 dlbox

加入 INTERP

空间浪费

字符串存储在常量池，统一通过指针访问

DSYM 是间接内存访问

extern void foo();
foo();

一种写法，两种情况

来自其他编译单元
- 直接 PC 相对跳转即可
- 否则性能太低
动态链接库
- 必须查表

为了统一两种情况，提升性能，诞生了 Procedure Linkage Table (PLT)

putchar@PLT:
  call DSYM(putchar)

foo@PLT:
  call foo

main:
  call putchar@PLT
  call foo@PLT

而上文的符号表便是 Global Offset Table (GOT)

数据

stdout/errno/environ 的麻烦
多个库都会用，但应该只有一个副本
特殊对待，示例如下

对于程序

#include <stdio.h>

int main() {
  fprintf(stdout, "Hello\n");
}

动态链接后查看 ELF 文件的 .rela.dyn 段

Relocation section '.rela.dyn' at offset 0x580 contains 9 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000003de8  000000000008 R_X86_64_RELATIVE                    1130
000000003df0  000000000008 R_X86_64_RELATIVE                    10e0
000000004028  000000000008 R_X86_64_RELATIVE                    4028
000000003fd8  000100000006 R_X86_64_GLOB_DAT 0000000000000000 __libc_start_main@GLIBC_2.34 + 0
000000003fe0  000200000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0
000000003fe8  000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
000000003ff0  000500000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0
000000003ff8  000600000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0
000000004030  000700000005 R_X86_64_COPY     0000000000004030 stdout@GLIBC_2.2.5 + 0

另外，对于 errno，为了保证线程安全

使用宏将 errno 替换成函数 __errno_location，在 .rela.plt 段中

Relocation section '.rela.plt' at offset 0x620 contains 1 entry:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000004018  000200000007 R_X86_64_JUMP_SLO 0000000000000000 __errno_location@GLIBC_2.2.5 + 0

可以参考 https://blog.csdn.net/lidan113lidan/article/details/119901186

xv6 代码导读

RTFM

xv6: A simple, Unix-like teaching operating system

RTFSC

https://github.com/mit-pdos/xv6-riscv

实操

参考

https://pdos.csail.mit.edu/6.828/2020/xv6.html

https://pdos.csail.mit.edu/6.828/2020/tools.html

安装

$ sudo pacman -S riscv64-linux-gnu-binutils riscv64-linux-gnu-gcc riscv64-linux-gnu-gdb qemu-arch-extra

之后

$ make qemu

即可

一些细节

单核

为了方便调试，修改 Makefile 中 CPUS 为 1

qemu 的一些快捷键

切换到控制台 <C-a-c>

退出 <C-a-x>

tmux 中枪

配置 vscode

使用 bear 生成 compile_commands.json

$ bear -- make

然后进入 vscode，会提示生成 .vscode/c_cpp_properties.json

对于 OS 也是同理，只不过是在 kernel 文件夹内生成 compile_commands.json，然后使用 vscode 打开根目录

发现 vscode 和 vim 都不报错了……

gdb 调试

$ make qemu-gdb
$ riscv64-linux-gnu-gdb

vscode 调试

$ make qemu-gdb

参考

https://www.cnblogs.com/KatyuMarisaBlog/p/13727565.html

https://code.visualstudio.com/docs/cpp/launch-json-reference

建立 .vscode/launch.json

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "debug xv6",
            "type": "cppdbg",
            "request": "launch",
            "stopAtEntry": true,
            "program": "${workspaceFolder}/kernel/kernel",
            "cwd": "${workspaceFolder}",
            "miDebuggerServerAddress": "localhost:26000",
            "miDebuggerPath": "/usr/bin/riscv64-linux-gnu-gdb",
            "environment": [],
            "externalConsole": false,
            "MIMode": "gdb",
            "setupCommands": [
                {
                    "description": "pretty printing",
                    "text": "-enable-pretty-printing",
                    "ignoreFailures": true
                }
            ],
            "logging": {
                // "engineLogging": true,
                // "programOutput": true
            }
        }
    ]
}

注意需要注释掉 target remote 127.0.0.1:26000

构建过程

内核代码

riscv64-linux-gnu-gcc    -c -o kernel/entry.o kernel/entry.S
riscv64-linux-gnu-gcc -Wall -Werror -O -fno-omit-frame-pointer -ggdb -MD -mcmodel=medany -ffreestanding -fno-common -nostdlib -mno-relax -I. -fno-stack-protector -fno-pie -no-pie   -c -o kernel/start.o kernel/start.c
...
riscv64-linux-gnu-ld -z max-page-size=4096 -T kernel/kernel.ld -o kernel/kernel kernel/entry.o kernel/start.o kernel/console.o kernel/printf.o kernel/uart.o kernel/kalloc.o kernel/spinlock.o kernel/string.o kernel/main.o kernel/vm.o kernel/proc.o kernel/swtch.o kernel/trampoline.o kernel/trap.o kernel/syscall.o kernel/sysproc.o kernel/bio.o kernel/fs.o kernel/log.o kernel/sleeplock.o kernel/file.o kernel/pipe.o kernel/exec.o kernel/sysfile.o kernel/kernelvec.o kernel/plic.o kernel/virtio_disk.o
riscv64-linux-gnu-objdump -S kernel/kernel > kernel/kernel.asm
riscv64-linux-gnu-objdump -t kernel/kernel | sed '1,/SYMBOL TABLE/d; s/ .* / /; /^$/d' > kernel/kernel.sym

initcode

initcode.o -> initcode.out -> initcode -> initcode.asm

riscv64-linux-gnu-gcc -Wall -Werror -O -fno-omit-frame-pointer -ggdb -MD -mcmodel=medany -ffreestanding -fno-common -nostdlib -mno-relax -I. -fno-stack-protector -fno-pie -no-pie -march=rv64g -nostdinc -I. -Ikernel -c user/initcode.S -o user/initcode.o
riscv64-linux-gnu-ld -z max-page-size=4096 -N -e start -Ttext 0 -o user/initcode.out user/initcode.o
riscv64-linux-gnu-objcopy -S -O binary user/initcode.out user/initcode
riscv64-linux-gnu-objdump -S user/initcode.o > user/initcode.asm

用户代码

riscv64-linux-gnu-gcc -Wall -Werror -O -fno-omit-frame-pointer -ggdb -MD -mcmodel=medany -ffreestanding -fno-common -nostdlib -mno-relax -I. -fno-stack-protector -fno-pie -no-pie   -c -o user/sh.o user/sh.c
riscv64-linux-gnu-ld -z max-page-size=4096 -N -e main -Ttext 0 -o user/_sh user/sh.o user/ulib.o user/usys.o user/printf.o user/umalloc.o
riscv64-linux-gnu-objdump -S user/_sh > user/sh.asm
riscv64-linux-gnu-objdump -t user/_sh | sed '1,/SYMBOL TABLE/d; s/ .* / /; /^$/d' > user/sh.sym

可以发现，对于一个用户程序，有相关的一系列文件

xxx.c 源代码

xxx.o 可重定位目标文件

xxx.asm 反汇编代码

xxx.sym 符号表

xxx.d 包含关系，尚不知如何生成

_xxx 实际文件系统中的程序

文件系统

gcc -Werror -Wall -I. -o mkfs/mkfs mkfs/mkfs.c
...
mkfs/mkfs fs.img README user/_cat user/_echo user/_forktest user/_grep user/_init user/_kill user/_ln user/_ls user/_mkdir user/_rm user/_sh user/_stressfs user/_usertests user/_grind user/_wc user/_zombie

参考

https://hitsz-lab.gitee.io/os-labs-2021/remote_env_gdb/

上下文切换

xv6 中的进程

静态视角

gcc/ld 创建代码、数据，参考 ldscript

动态视角

寄存器

通用寄存器 + $pc

内存

$satp 配置出的地址空间

QEMU 使用 info mem 查看

持有的操作系统对象

不可见

文件描述符

另外在 xv6 中键入 <C-p> 似乎可以打印进程

1 sleep  init
2 sleep  sh

感觉不是 qemu 的功能

调试内核代码

kernel/main.c 中 main 函数调用 kernel/proc.c 中的 userinit 函数

其中会分配页面，并将 initcode 的内容拷贝进来

   0:   00000517                auipc   a0,0x0
   4:   02450513                addi    a0,a0,36 # 24 <init>
   8:   00000597                auipc   a1,0x0
   c:   02358593                addi    a1,a1,35 # 2b <argv>
  10:   00700893                li      a7,7
  14:   00000073                ecall

实际上就是虚拟地址的 0x0 处，打断点观察

(gdb) b *0

此时的地址空间如下

(qemu) info mem
vaddr            paddr            size             attr
---------------- ---------------- ---------------- -------
0000000000000000 0000000087f73000 0000000000001000 rwxu-a-
0000003fffffe000 0000000087f77000 0000000000001000 rw---a-
0000003ffffff000 0000000080007000 0000000000001000 r-x--a-

其中

0000003ffffff000 处为 trampoline
0000003fffffe000 处为 trapframe

用户进程不可见

来到 ecall 指令处

0x0000000000000014 in ?? ()
=> 0x0000000000000014:  73 00 00 00     ecall

打印 $stvec

(gdb) p/x $stvec
$1 = 0x3ffffff000

也就是中断处理程序 trampoline 的首地址，打上断点

可以发现指令就是 kernel/trampoline.S 里面的内容

此时打印 $sscratch

(gdb) p/x $sscratch
$2 = 0x3fffffe000

也就是 trapframe 的地址

trampoline.S 通过这个地址保存寄存器现场，包括 $satp 到 trapframe 结构体中

也就是整个进程的状态被封存了起来

trapframe 结构体的定义位于 kernel/proc.h 中

trampoline.S 最后会切换 $satp 到内核的地址空间

(qemu) info mem
vaddr            paddr            size             attr
---------------- ---------------- ---------------- -------
000000000c000000 000000000c000000 0000000000001000 rw---ad
000000000c001000 000000000c001000 0000000000001000 rw-----
000000000c002000 000000000c002000 0000000000001000 rw---ad
000000000c003000 000000000c003000 00000000001fe000 rw-----
000000000c201000 000000000c201000 0000000000001000 rw---ad
000000000c202000 000000000c202000 00000000001fe000 rw-----
0000000010000000 0000000010000000 0000000000001000 rw---a-
0000000010001000 0000000010001000 0000000000001000 rw---ad
0000000080000000 0000000080000000 0000000000007000 r-x--a-
0000000080007000 0000000080007000 0000000000001000 r-x----
0000000080008000 0000000080008000 0000000000003000 rw---ad
000000008000b000 000000008000b000 0000000000006000 rw-----
0000000080011000 0000000080011000 0000000000011000 rw---ad
0000000080022000 0000000080022000 0000000000001000 rw-----
0000000080023000 0000000080023000 0000000000003000 rw---ad
0000000080026000 0000000080026000 0000000007f4b000 rw-----
0000000087f71000 0000000087f71000 0000000000007000 rw---ad
0000000087f78000 0000000087f78000 0000000000088000 rw-----
0000003ffff7f000 0000000087f78000 000000000003f000 rw-----
0000003fffffd000 0000000087fb7000 0000000000001000 rw---ad
0000003ffffff000 0000000080007000 0000000000001000 r-x--a-

注意大部分映射都是恒等映射

并跳转到 usertrap 函数，位于 kernel/trap.c 中

其中会调用 syscall 函数

通过系统调用 exec 执行用户程序 _init

而 user/init.c 则会打开 console，并 fork 出进程执行 _sh 程序

关键在于，操作系统可以通过 struct proc 完全控制进程的状态

trapframe 结构体是 struct proc 的一部分

也就是说，操作系统修改任何一个状态机，例如，执行系统调用

也可以将任何另一个状态机调度到处理器上

这便实现了处理器虚拟化上下文切换的机制

状态的封存：体系结构相关的处理

x86-64

中断/异常会伴随堆栈切换
中断前的寄存器保存在堆栈上
中断处理程序非常好写，指令实现很复杂

xv6

把进程的 trap frame 分配到固定的虚拟地址，保存在 $sscratch 中
保存完毕后切换到内核线程执行，包括堆栈切换
中断处理程序稍微复杂一点，指令实现很简单

TODO

同时调试内核代码和用户代码

vscode 只能写两个配置

或者显式的切换

-exec symbol-file user/_sh
-exec symbol-file kernel/kernel

gdb 加入符号文件后

add-symbol-file user/_sh

直接打断点

<optimized out>

vscode 和 gdb 都有这个问题

怀疑是 riscv64-linux-gnu-gdb 的问题

于是考虑安装 riscv64-unknown-elf-gdb

$ yay -S riscv-gnu-toolchain-bin

还需要 libpython3.8.so.1.0

$ yay -S python38

然后报错

dwarf2_find_location_expression: Corrupted DWARF expression.

https://github.com/riscv-collab/riscv-gnu-toolchain/issues/935

似乎是编译器的问题，尝试重新编译

Fatal error: invalid -march= option: `rv64imafdc'

暂时放弃

处理器调度

理想的世界

简化假设

单 CPU
只有两种进程 - 计算密集 / IO 密集
进程之间没有协作 - 共享资源

Round-Robin

问题是 IO 密集的进程会频繁让出 CPU

于是 Vim 疯狂卡顿

如果引入 Producer 和 Consumer，同样会因为信号量而频繁让出 CPU

于是可以手动设置优先级

UNIX niceness

-20 .. 19 的整数，越 nice 越让别人得到 CPU

-20 most favorable to the process
19 least favorable to the process

坏人躺下好人才能上

例如

taskset -c 0 nice -n 19 yes > /dev/null &
taskset -c 0 nice -n  9 yes > /dev/null &

-c 代表绑定到某个 CPU 上运行

pkill yes

MLFQ

然而手动设置实在过于麻烦

于是引入动态优先级

设置若干个 Round-Robin 队列，每个队列对应一个优先级

动态优先级调整策略

优先调度高优先级队列
用完时间片 → 坏人
让出 CPU IO → 好人

然而计算密集的进程可以主动让出 CPU 假装成好人

并且在这种策略下，Producer/Consumer 会获得最高优先级，while (1) 会完全饥饿

于是需要定期把所有人优先级拉平

CFS

让系统里的所有进程尽可能公平地共享处理器

为每个进程记录精确的运行时间
中断/异常发生后，切换到运行时间最少的进程执行

为了实现优先级，可以设置每个进程 vruntime

好人的钟快一些，坏人的钟慢一些

一些细节和问题

fork 出的新进程应继承父进程的 vruntime
I/O 以后回来 vruntime 严重落后，为了赶上，CPU 会全部归它所有
vruntime 整数溢出，参考 CS144 wrapping_integers.hh

bool less(u64 a, u64 b) {
  return (i64)(a - b) < 0;
}

现实的世界

优先级翻转

在实时任务操作系统中，低优先级的任务和高优先级的任务存在共享资源

高优先级执行完了，才能是低优先级

一旦低优先级的任务在持有互斥锁的时候被赶下了处理器，高优先级的任务就和低优先级一样了

多处理器调度

迁移

迁移？在处理器之间迁移会导致 cache/TLB 全都白给

不迁移？线程退出，瞬间处理器开始围观

多用户、多任务

A 和 B 使用同一个服务器

A 要跑一个任务，因为要调用一个库，只能单线程跑

B 跑并行的任务，创建 1000 个线程跑

B 获得几乎 100% 的 CPU

对策 Linux Namespaces Control Groups (cgroups)

namespaces (7), cgroups (7)

轻量级虚拟化，创造操作系统中的操作系统

Big.LITTLE/能效比

调度器还需要了解 CPU 之间的差异

Non-Uniform Memory Access

Producer/Consumer 位于同一个/不同 module 性能差距可能很大

分配了 1/2 的处理器资源，反而速度更快了

调度

建模
- 理解和总结过去发生了什么
- profiling 和 trace; PMU
预测
- 试图预知未来可能发生什么
决策
- 应该如何调整系统行为

操作系统不完全背这个锅，让程序提供 scheduling hints

操作系统设计

操作系统设计：一组对象 + 访问对象的 API

操作系统实现：一个 C 程序实现上面的设计

操作系统到底应该提供什么对象和 API

可以大而全 (Linux/Windows API)

The Open Group Base Specifications Issue 7 (2018 Ed.)

Windows API Index

API 意味着可以互相模拟

Windows Subsystem for Linux (WSL)
Linux Subsystem for Windows (Wine)

可以只有最少的硬件抽象 (Microkernel)

把尽可能多的功能都用普通进程实现

失效隔离在进程级

比如之前提过可以将加载的代码从内核态转移到用户态

只把不能放在用户态的东西留在内核里

状态机
状态机之间的协作机制 - 进程间通信
权限管理

赋予进程最少的权限，就能降低错误带来的影响

只需要 send 和 receive 两个系统调用

主要用来实现 RPC (remote procedure call)

例子

Minix
seL4 - Whitepaper

可以没有用户态 (Unikernel)

我们有虚拟机 - 硬件虚拟化

直接让 Lab2 跑应用程序

应用代码直接和 klib, AbstractMachine, Lab 代码静态链接

任何操作 (包括 I/O) 都可以直接做

系统调用直接变成普通的函数调用

UNIKERNEL：从不入门到入门

极限速通操作系统实验

一把大锁保平安

#define atomic \
  for (int __i = (lock(), 0); __i < 1; __i++, unlock())

直接使用

atomic {
  ...
}

需要考虑嵌套使用 atomic

由于只有一把锁，只需要在第一次调用 lock 时上锁，最后一次调用 unlock 时解锁即可

software engineering

two versions

functional - model - naive correct
performance - buggy

操作系统导论 虚拟化

TOC

OS EX Virtualization

操作系统上的进程

操作系统启动后到底做了什么？

定制最小的 Linux

OS API Overview

fork()

Fork Bomb

ex1

ex2

ex3

execve()

_exit()

进程的地址空间

进程的地址空间

observation

summary

vdso

系统调用的实现

进程的地址空间管理

地址空间的隔离

游戏修改器

软件热补丁

矛与盾

系统调用和 UNIX Shell

Shell

复刻经典

RTFSC

Traps and Pitfalls

终端和 Job control

C 标准库的实现

Freestanding

封装

memset

printf

popen 和 pclose

execve

error

environ

malloc 和 free

RTFM

A fork() in the Road

fork() 行为的补充解释

offset

copy-on-write

状态机、fork() 和魔法

搜索并行化

跳过初始化

备份和容错

POSIX Spawn

可执行文件

RTFM

状态机的描述

操作系统上的可执行文件

execve

She-bang

解析可执行文件

调试信息

观察

栈回溯信息

逆向工程

重定位

整个编译工具链

ELF 的细节

可执行文件的加载

静态 ELF 加载器

Boot Block Loader

Linux Kernel ELF Loader

动态链接和加载

设计一个新的二进制文件格式 .dl

dl 文件的设计缺陷

xv6 代码导读

RTFM

RTFSC

实操

一些细节

单核

qemu 的一些快捷键

配置 vscode

操作系统导论虚拟化

`_exit()`

设计一个新的二进制文件格式 `.dl`