MIT 6.S081 2020 学习笔记

| 0 | 前言
| 1 | Lab util 相关
| 2 | Lab syscall 相关
1. | 2.1 | Chapter 2 note & LEC 3 note
2. | 2.2 | Lab: system calls
  1. | 2.2.1 | System call tracing (moderate)
  2. | 2.2.2 | Sysinfo (moderate)
| 3 | Lab pgtbl 相关
| 4 | Lab trap 相关
1. | 4.1 | Chapter 4 & LEC 6 note
2. | 4.2 | Lab: traps
| 5 | Lab lazy 相关
| 6 | Lab cow 相关
1. | 6.1 | Lab: Copy-on-Write Fork for xv6
| 7 | Lab thread 相关
| 8 | Lab lock 相关
1. | 8.1 | Lab: locks
  1. | 8.1.1 | Memory allocator (moderate)
  2. | 8.1.2 | Buffer cache (hard)
| 9 | Lab fs 相关
| 10 | Lab mmap 相关
| 11 | Lab net 相关
| 12 | Meltdown & RCU
1. | 12.1 | LEC 22 note
2. | 12.2 | LEC 23 note

| 0 | 前言

MIT 6.S081~~(6.828?6.1810?)~~ 是 MIT 一门操作系统课程，课程名是 Operating System Engineering（2020年课程网址）这是我的学习笔记，包含课程讲座(LECs), xv6 Book 以及实验(Labs)。

✅最近一次更新：2024-4-23

后补内容：

LECs:
- LEC 6: xIE, xPIE, xPP bit 部分梳理
- LEC 18: OS Organization
- LEC 20: Kernels and HLL
- LEC 21: Networking
xv6 Book:
- Chapter 5: console/uart& mtvec/timervev/CLINT 梳理
- Chapter 9: Concurrency revisited
Labs:
- Lab1 util: find & xargs 部分
- Lab8 Lock: Buffer cache 部分
- Lab11 net

💻：2018 Macbook Air

CPU: 1.6 GHz 双核Intel Core i5
8G 内存
用虚拟机跑的 Ubuntu 22.04 LTS

或许是电脑配置太低 usertests 基本都会超时三四十秒，所以延长了一点测试时间。(也可能是我代码的问题，不过暂时不优化了)。

笔记中像这样引用的句子为课程中的原文。

🔗 我的 Lab 代码链接
 🔗【知乎-肖宏辉】MIT6.S081 操作系统工程中文翻译

| 1 | Lab util 相关

| 1.1 | LEC 1 & Chapter 1 note

exec
redirect
- cat < input.txt
The 2>&1 tells the shell to give the command a file descriptor 2 that is a duplicate of descriptor 1
pipe

在 fork 之前创建 pipe 然后把 ls 的 stdout(1) 换成 pipefd[1]; grep 的 stdin(0) 换成 pipefd[0] 。
File System
- list
- a file
  
  The file’s inode and the disk space holding its content are only freed when the file’s link count is zero and no file descriptors refer to it.
- cd

| 1.2 | LEC 2 note

Use GNU debugger
- Conditional breakpoints
  
  break <location> if <condition> sets a breakpoint at the specified location, but only breaks if the condition is satisfied.
  
  cond <number> <condition> adds a condition on an existing breakpoint.
- Watchpoints
  
  watch <expression> will stop execution whenever the expression’s value changes.
  
  watch -l <address> will stop execution whenever the contents of the specified memory address change.
  
  What’s the difference between wa var and wa -l &var?
  
  rwatch [-l] <expression> will stop execution whenever the value of the expression is read.

| 1.3 | Lab: Xv6 and Unix utilities

This lab will familiarize you with xv6 and its system calls.

📌 Lab 1 代码在这里

| 1.3.1 | Boot xv6 (easy)

sudo apt-get install qemu-system-misc=1:4.2-3ubuntu6 失败

参考下面的手动编译方法编译 qemu
- 补充： sudo apt-get install libpixman-1-dev 安装 pixman devel packag

| 1.3.2 | sleep (easy)

| 1.3.3 | pingpong (easy)

为了阻止父进程在向管道写了数据后立刻把数据读出来，使得子进程不能读到数据，可以在父进程调用 write(pipefds[1], "c", 1); 后接着调用 sleep 。

| 1.3.4 | primes (moderate)/(hard)

| 1.3.5 | find (moderate) & xargs (moderate)

后补

| 2 | Lab syscall 相关

| 2.1 | Chapter 2 note & LEC 3 note

Thus an operating system must fulfill three requirements: multiplexing, isolation, and interaction.

Multiplexing 多路复用，也就是分时(time-sharing)，防止某个进程持续占有 CPU。

Isolation 隔离，为减少某些不受信任的应用对其他的影响。

Interaction 交互，为实现进程之间的信息交互，实现更复杂的功能。

| 2.1.1 | Abstracting physical resources

To achieve strong isolation it’s helpful to forbid applications from directly accessing sensitive hardware resources, and instead to abstract the resources into services.

Unix transparently switches hardware CPUs among processes, saving and restor- ing register state as necessary

Many forms of interaction among Unix processes occur via file descriptors.

| 2.1.2 | User mode, supervisor mode, and system calls

RISC-V has three modes in which the CPU can execute instructions: machine mode, supervisor mode, and user mode.

An application can execute only user-mode instructions is said to be running in user space.

The software in supervisor mode can also execute privileged instructions and is said to be running in kernel space.

The software running in kernel space (or in supervisor mode) is called the kernel.

CPUs provide a special instruction that switches the CPU from user mode to supervisor mode and enters the kernel at an entry point specified by the kernel. (RISC-V provides the ecall instruction)

| 2.1.3 | Kernel organization

monolithic kernel: entire operating system resides in the kernel

在宏内核(monolithie kernel)情况下，内核等于操作系统。

bad: no isolation within (subsystems)

微内核(microkernel)暂时不考虑。

| 2.1.4 | Process overview

The unit of isolation in Unix is a process.

隔离包括进程(processes)之间的隔离，以及进程和内核之间的隔离。

进程提供了一个看上去私有的内存空间即地址空间(address space)。

地址空间通过页表(page tables)实现。

| 2.1.5 | Code: starting xv6 and the first process

_entry -> start -> main -> userinit -> /init

| 2.2 | Lab: system calls

sev1n@ubuntu2204 ~/f/6/xv6-labs-2020> ./grade-lab-syscall                                                     syscall!?
make: 'kernel/kernel' is up to date.
== Test trace 32 grep == trace 32 grep: OK (1.4s)
== Test trace all grep == trace all grep: OK (1.1s)
== Test trace nothing == trace nothing: OK (0.9s)
== Test trace children == trace children: OK (16.6s)
== Test sysinfotest == sysinfotest: OK (2.8s)
== Test time ==
time: OK
Score: 35/35

| 2.2.1 | System call tracing (moderate)

注意 sys_trace 成功时返回值为 0

📌 这部分代码在这里 (没注意把 .gdb_history 提交了)

| 2.2.2 | Sysinfo (moderate)

xv6 的内存管理机制

通过单链表 freelist 连接所有闲置的 PAGE，每次从头(kmem->freelist)存取

📌 这部分代码在这里

| 3 | Lab pgtbl 相关

| 3.1 | Chapter 3 & LEC 4 note

| 3.1.1 | Paging hardware

PTE_U controls whether instructions in user mode are allowed to access the page;

satp 寄存器保存 l2 页表的物理地址

虚拟地址到物理地址的翻译过程：

当前进程的 satp 寄存器找到 l2 页表物理地址
通过 l2 的 9 比特(bits 和 2^9=512 个 PTE 一一对应)找到对应的 44 比特 PPN。
- l2_pte = l2_pgtbl[PX(2, va)]
通过 l2 PPN 找到 l1 页表的物理地址。
- l1_pgtbl = PTE2PA(l2_pte)
通过 l1 的 9 比特找到 l0 页表的物理地址。
- l0_pgtbl = PTE2PA(l1_pgtbl[PX(1, va)])
通过 l0 的 9 比特找到对应页的物理地址(A)。
- l0_pte = l0_pgtbl[PX(0, va)]
得到最终物理地址
- pa = PTE2PA(l0_pte) | (va & PGSHIFT)

为什么多级页表可以节省内存？

使用多级页表，程序一开始只需要三个页表(3*512 PGSIZE)，l2,l1 页表内都只有一个页表项(pte)，其他空余的位置不需要分配页表项，等需要的时候再分配。

// kernel/riscv.h
#define PGSIZE 4096 // bytes per page
#define PGSHIFT 12  // bits of offset **within** a page

#define PGROUNDUP(sz)  (((sz)+PGSIZE-1) & ~(PGSIZE-1))  // 向上取整对其一页
#define PGROUNDDOWN(a) (((a)) & ~(PGSIZE-1))  // 向下取整对其

#define PTE_V (1L << 0) // valid
#define PTE_R (1L << 1)
#define PTE_W (1L << 2)
#define PTE_X (1L << 3)
#define PTE_U (1L << 4) // 1 -> user can access
// 只用了这几个 bit

// shift a physical address to the right place for a PTE.
// 12 即 PGSHIFT，10 是因为低 10 bit 是 Flags
#define PA2PTE(pa) ((((uint64)pa) >> 12) << 10)

#define PTE2PA(pte) (((pte) >> 10) << 12)

// 0x3FF 10 bits
#define PTE_FLAGS(pte) ((pte) & 0x3FF)

// extract the three 9-bit page table indices from a virtual address.
#define PXMASK          0x1FF // 9 bits
#define PXSHIFT(level)  (PGSHIFT+(9*(level)))
#define PX(level, va) ((((uint64) (va)) >> PXSHIFT(level)) & PXMASK)

// one beyond the highest possible virtual address.
// MAXVA is actually one bit less than the max allowed by
// Sv39, to avoid having to sign-extend virtual addresses
// that have the high bit set.
// 即最高有效位
#define MAXVA (1L << (9 + 9 + 9 + 12 - 1))

typedef uint64 pte_t;
typedef uint64 *pagetable_t; // 512 PTEs
typedef uint64 *pagetable_t; // 512 PTEs 0~2^9

| 3.1.2 | Kernel address space

| 3.1.3 | Code: creating an address space

1	kinit->kvminit->kvminithart->procinit

kinit 把内核数据段结束(.data)到定义的 PHYSTOP 之间的地址分页，组织到 freelist 链表里。
kvminit
- 首先调用 kalloc 分配一页作为内核页表(kernel_pagetable)。
- 调用 kvmmap 映射上一节(3.1.2)的地址
- kvmmap 调用 mappages(kernel_pagetable, va, sz, pa, perm)
- mappages 调用 pte = walk(pagetable, a, 1) 把虚拟地址 va 映射到内核页表。
kvminithart 设置 satp 寄存器并清空 TLB 缓存。
procinit
- 调用 kvmmap 给 KSTACK 宏计算出来的内核栈虚拟地址映射物理地址。
- 调用 kvminithart

注意：

kalloc 分配的一页可以作为一个页表。因为 PAGESIZE 是 4096B 也就是 8*512B，页表项只用了 54 bit 不到 8B，所以是够的。
procinit 最后还要再调用一次 kvminithart 因为”当内核修改页表时，它需要通知硬件关于这些变化，以便处理器能够使用最新的页表进行地址转换。” (by chatGPT)。存疑？

| 3.1.4 | Code: exec

把存在文件系统的一个文件加载到内存同时初始化用户地址空间(address space)

在检查完文件格式后 exec 会分配一个新的页表，给每个 ELF 段(segment) 分配并加载到内存。

Exec allocates a new page table with no user mappings with proc_pagetable, allocates memory for each ELF segment with uvmalloc, and loads each segment into memory with loadseg. loadseg uses walkaddr to find the physical address of the allocated memory at which to write each page of the ELF segment, and readi to read from the file.

给用户栈(ustack)分配一页然后初始化。

Exec copies the argument strings to the top of the stack one at a time, recording the pointers to them in ustack. It places a null pointer at the end of what will be the argv list passed to main. The first three entries in ustack are the fake return program counter, argc and argv pointer.

分配一个 guard page 在 ustack 的上方。

| 3.2 | LEC 5 note

reg    | name  | saver  | description
-------+-------+--------+------------
x0     | zero  |        | hardwired zero
x1     | ra    | caller | return address
x2     | sp    | callee | stack pointer
x3     | gp    |        | global pointer
x4     | tp    |        | thread pointer
x5-7   | t0-2  | caller | temporary registers
x8     | s0/fp | callee | saved register / frame pointer
x9     | s1    | callee | saved register
x10-11 | a0-1  | caller | function arguments / return values
x12-17 | a2-7  | caller | function arguments
x18-27 | s2-11 | callee | saved registers
x28-31 | t3-6  | caller | temporary registers
pc     |       |        | program counter

| 3.3 | Lab: page tables

| 3.3.1 | Print a page table (easy)

📌 这部分代码在这里

| 3.3.2 | A kernel page table per process (hard)

为什么每个进程都需要一个内核页表？

用户态页表使得用户态可以直接访问虚拟地址(由硬件来翻译)
但是当陷入内核态时，内核的页表没有用户态的虚拟地址，所以不能直接访问用户虚拟地址
为了访问用户虚拟地址，xv6 通过 walk 函数把来自用户态的虚拟地址转换成内核可操作的物理地址
当每个进程有自己的内核页表时的时候，把 satp 设置成内核页表的地址就可以直接访问用户态虚拟地址了

所以 proc 结构体在初始化时，多维护一个 kpagetable，首先映射(map)内核空间，然后当用户页表更新时，同时更新内核页表；在切换进程时也需要切换全局内核页表(调度器内核页表)。

需要注意的几点：

每个进程在创建时都需要设置好内核页表，包含一下几点，分配其空间(一页)，映射内核地址空间。
- 进程的创建在 allocproc()，它被 userinit() 和 fork() 调用。
每个进程都有内核栈(kstack)，在之前的版本，所有的内核栈都被 procinit 映射到全局的内核页表。现在映射内核栈需要在分配好内核页表之后。
这一节只需要保持 kpagetable 和全局变量kernel_pagetable 一致，所以还要再把内核栈映射到全局 kernel_pagetable，不用同步更新内核页表。

释放内核页表(参考 proc_freepagetable)

清理叶子结点。
调用 freewalk 释放占用的物理页框。

📌 这部分代码在这里

| 3.3.3 | Simplify `copyin/copyinstr` (hard)

为了防止用户地址空间和内核地址空间重叠所以需要限制用户虚拟地址上限是 0xC000000。为了内核页表能访问用户态虚拟地址，需要保持内核页表和用户页表在用户地址空间一致。之前的 kernel_pagetable 映射了 CLINT(0x2000000)，如果限制 MAXUVA 是 CLINT 的话通过不了 usertests sbrkmuch 所以可以不用映射 CLINT 。

用户页表的创建/更新过程如下：

proc_pagetable()
- 调用 uvmcreate() 获得一个物理页框。
- 然后 mappages() 映射 trampoline 以及 trapframe。
fork()
- 先调用 allocproc() 获得一个基本的用户页表。
- 然后 uvmcopy() 复制父进程的页表即可。
userint()
- 同样是 allocproc() 起手。
- 然后调用 uvminit 把 initcode 映射到 va=0 的位置。
exec() 见 | 3.1.4 |
sbrk()
- growproc() 根据传入参数正负调用uvmalloc/uvmdealloc。
- uvmalloc() 从原来的虚拟地址往上分配页。
- uvmdealloc() 从原来的虚拟地址往下减少页。

内核页表应该只建立相同映射，不再分配和释放物理帧。

用户页表释放过程:

freeproc() 在 proc_freepagetable() 的时候也要对内核页表同样操作。
exec() 会释放原来的用户页表。
因为有 kstack, kpagetable 应随着进程的创建释放而创建释放。

注意：

限制用户地址空间不超过：
- p->sz 是当前进程占用内存大小，限制它即可。
- 写 p->sz 的地方只有两处是用变量赋值的分别是。
  1
  2
  exec.c:114:3: p->sz = sz;
  proc.c:264:3: p->sz = sz; // growproc
  这两个 sz 都是 uvmalloc() 的返回值，所以在这里限制就可以。
当 exec goto bad 的时候，kpagetable 必须和刚开始一样。

~~梳理完 user pagetable 机制之后终于是清晰了很多，但是 debug 还是痛苦且漫长。做出来之后还是很有成就感(笑)~~

📌 这部分代码在这里

| 4 | Lab trap 相关

| 4.1 | Chapter 4 & LEC 6 note

There are three kinds of event which cause the CPU to set aside ordinary execution of instructions and force a transfer of control to special code that handles the event.

system call(ecall)
exception
device interrupt

~~不会翻译 trap 就直接写英文了。~~

tarp 应该是 tansparent，尤其对于中断，因其随时可能发生。

While commonality among the three trap types suggests that a kernel could handle all traps with a single code path, it turns out to be convenient to have separate assembly vectors and C trap handlers for three distinct cases: traps from user space, traps from kernel space, and timer interrupts.

| 4.1.1 | RISC-V trap machinery

What does the CPU’s “mode” protect?
i.e. what does switching mode from user to supervisor allow?
supervisor can use CPU control registers:

satp – page table physical address

stvec – ecall jumps here in kernel; points to trampoline

sepc – ecall saves user pc here

sscratch – address of trapframe

scause – a number put by RISC-V that describes the reason for the trap

sstatus: supervisor status register

…..

supervisor can use PTEs that have no PTE_U flag
but supervisor has no other powers! e.g.

can’t use addresses that aren’t the in page table（注意：这是当 satp 的 mode bit 不为 0 的情况下，所以如果一个在 S-mode 下设置 satp=0 则可以直接访问物理地址）
so kernel has to carefully set things up so it can work

如何识别 RISC-V CPU 处于什么模式/xIE, xPIE, xPP bit 是怎么作用的(x = M/S/U)

后补

| 4.1.2 | Traps from user space

preview:
  write()                        write() returns              User
  -----------------------------------------------------------------
  uservec() in trampoline.S      userret() in trampoline.S   Kernel
  usertrap() in trap.c           usertrapret() in trap.c
  syscall() in syscall.c           ^
  sys_write() in sysfile.c      ---|

ecall did three things:

change mode from user to supervisor

save pc in sepc

jump to stvec(the kernel previously set stvec, before jumping to user space)

uservec():
- 保存 32 个用户寄存器(用于恢复)
- 切换到内核页表, 内核栈
- 跳(jump)到 usertrap()
usertrap()
- 修改 stvec = kernelvec
- 根据 scause 调用不同的处理函数(handler)
- usertrapret()
usertrapret()
- 设置 stvec = uservec
- 在 trapframe 中保存以下未在 tampline.S 保存的值, 也是初始化新的 trapframe 的地方:
  - kpagetable 所以 uservec 可以切换
  - kstack 所以 uservec 可以切换
  - hartid, 可能被其他 CPU 调度要更新
  - usertrap 地址，让 uservec 找得到
- 设置 sstatus 寄存器
- 设置 sepc = trapframe->epc
userret()
- 切换 satp = 用户页表
- 恢复 32 个用户寄存器
- sret
sret
- set sstatus “previous mode” bit to user
- 恢复 pc = sepc

| 4.1.3 | Traps from kernel space

kernelvec

保存 31 个寄存器到内核栈

调用 kerneltrap

恢复寄存器

sret

| 4.2 | Lab: traps

| 4.2.1 | RISC-V assembly (easy)

a0~a7, a2 holds 13 when calling printf in main.
f and g is called in compiler.
printf is located at 0x628
ra is 0x38
out put is “HE110 Worid”, if big-endian i = 0x726c6400. 57616 need not to be changed.
after “y=”, it will print (int)$a2, because the 2nd arguement of format strings is a2.

| 4.2.2 | Backtrace (moderate)

~~提交记录有点乱，就直接贴 diff 了~~

diff --git a/kernel/defs.h b/kernel/defs.h
index 4b9bbc0..137c786 100644
--- a/kernel/defs.h
+++ b/kernel/defs.h
@@ -80,6 +80,7 @@ int             pipewrite(struct pipe*, uint64, int);
 void            printf(char*, ...);
 void            panic(char*) __attribute__((noreturn));
 void            printfinit(void);
+void            backtrace(void);

 // proc.c
 int             cpuid(void);
diff --git a/kernel/printf.c b/kernel/printf.c
index e1347de..e9e1ccc 100644
--- a/kernel/printf.c
+++ b/kernel/printf.c
@@ -132,3 +132,15 @@ printfinit(void)
   initlock(&pr.lock, "pr");
   pr.locking = 1;
 }
+
+void backtrace() {
+  printf("backtrace:\n");
+  uint64 currfp = r_fp();
+  uint64 stack_top = PGROUNDUP(currfp);
+  while(currfp <= stack_top) {
+    printf("%p\n", *(uint64*)(currfp-8));
+    currfp = *(uint64*)(currfp-0x10);
+    if(*(uint64*)(currfp-8) <= PLIC)
+      break;
+  }
+}
diff --git a/kernel/riscv.h b/kernel/riscv.h
index 0aec003..7a443e9 100644
--- a/kernel/riscv.h
+++ b/kernel/riscv.h
@@ -311,6 +311,14 @@ r_ra()
   return x;
 }

+static inline uint64
+r_fp()
+{
+  uint64 x;
+  asm volatile("mv %0, s0" : "=r" (x) );
+  return x;
+}
+
 // flush the TLB.
 static inline void
 sfence_vma()
diff --git a/kernel/sysproc.c b/kernel/sysproc.c
index e8bcda9..fd19d2a 100644
--- a/kernel/sysproc.c
+++ b/kernel/sysproc.c
@@ -70,6 +70,7 @@ sys_sleep(void)
     sleep(&ticks, &tickslock);
   }
   release(&tickslock);
+  backtrace();
   return 0;
 }

| 4.2.3 | Alarm (hard)

📌 这部分代码在这里

| 5 | Lab lazy 相关

| 5.1 | Chapter 4.6 & LEC 8 note

Key idea: change page tables on page fault, update page table instead of panic and restart instruction.

Information we might need at page fault to do something interesting:

The virtual address that caused the fault
See stval register; page faults set it to the fault address

The type of violation that caused the fault
See scause register value (instruction, load, and Store page fault)

The instruction and mode where the fault occurred

User IP: tf->epc

U/K mode: implicit in usertrap/kerneltrap

| 5.1.1 | Lazy Allocation

sbrk allocates memory that may never be used.

plan: allocate physical memory when application needs it

Adjust p->sz on sbrk, but don’t allocate.

When application uses that memory, it will result in page fault.

On pagefault allocate memory resume at the fault instruction.

除此以外，在 unmap 的时候，对延迟分配(lazy allocation) 还未分配的内存应该不处理，继续循环。

| 5.1.2 | Copy-On-Write (COW) fork

| 6.1 |

| 5.1.3 | 其他

Zero Fill Page

bss 段(segment) 最开始都是 0，所以可把 bss 虚拟地址先映射到同一页且只读。

当向 bss 写时触发缺页错误(page fault)，然后再该映射虚拟地址对应页。
Demand Paging

idea: load pages from the file on demand.

allocate page table entries, but mark them on-demand.

on fault, read the page in from the file and update page table entry.

need to keep some meta information(Virtual Memoroy Area) about where a page is located on disk.

Paging From Disk

idea: store less-frequently used parts of the address space on disk. page-in and page-out pages of the address address space transparently.

Memory-Mapped Files

放在 mmap lab 讨论

| 5.2 | Chapter 5 & LEC 9 note

interrupts

the interrupt tells the kernel the device hardware wants attention.
new issues:
- asynchronous: between running process and interrupt handler
- concurrency: device and process run in parallel
- programming devices: device can be difficult to program
中断的来源

device -> PLIC(Platform Level Interupt Control) -> CPU

内核通过对 PLIC 编程灵活处理中断。
驱动(driver) 负责管理特定的设备：
- configures the device hardware.
- tells the device to perform operations.
- handles the resulting interrupts.
- interacts with processes that may be waiting for I/O from the device.
Many device drivers execute code in two contexts:
- a top half that runs in a process’s kernel thread called via system calls.
- a bottom half that executes at interrupt time as interrupt handler handling device interupt.

| 5.2.1 | Code: Console input/output

当 xv6 启动时，$ 是如何被打印到屏幕？

当用户在键盘敲下 ls\n 时，屏幕是如何显示 ls 以及 ls 执行的内容的？

涉及 console/uart 的梳理，后补

| 5.2.2 | Timer interrupts

涉及 mtvec/timervev/CLINT 后补

| 5.3 | Lab: xv6 lazy page allocation

| 5.3.1 | Lazy allocation (moderate) && Lazytests and Usertests (moderate)

sbrk() 不分配物理内存，只修改 p->sz
什么时候分配：
- 用户访问时触发用户触发缺页异常时(usertrap->page fault)，再分配
- 用户读写 sbrk() 返回的地址时，由于 xv6 是软件层面 pa = walkaddr(va)，所以此时也需要分配。
注意：
- va = p->sz 的时候是非法的。
- 如果 sbrk(n) < 0, 需要 sbrk() 释放(unmap) 参考 uvmdealloc()。否则 p->sz 减小后，上面的虚拟地址没有 unmap 且已经丢失。

📌 这部分代码在这里

| 6 | Lab cow 相关

The goal of copy-on-write (COW) fork() is to defer allocating and copying physical memory pages for the child until the copies are actually needed, if ever.

uvmcopy()
- walk() 父进程的页表， *pte &= ~PTE_W,*pte |= PTE_C
- 映射到子进程的页表, 引用次数增加(reference count ++);
usertrap()
- 在缺页异常的时候，识别出 COW 页。
- 分配物理内存, 分配失败则杀死(kill)。
- 复制原数据
- 修改页表项(PTE) 指向新的物理帧且可写，原页的 reference count--
- 重新执行
proc_freepagetable()
- 如果是 COW 页，--reference count == 0 则释放
reference count 如何实现
- NPROC = 64，小于 0xff
- 计算 freerange() 创建了多少页，然后根据该数量减去内存(见下面的 kalloc.c-kinit())。
- kvmmap() 的时候不应该设置 ref count，因为其是表示当前页被用户页表引用的数量。
- kalloc() 时设置 cnt = 0
- mappages() 的时候，cnt++
- uvmunmap() 且 do_free 时 cnt--
- cnt==1 则修改 pte 即可。
copyout()
- 和 usertap() 类似操作

| 6.1 | Lab: Copy-on-Write Fork for xv6

diff --git a/kernel/defs.h b/kernel/defs.h
index 4b9bbc0..234b90d 100644
--- a/kernel/defs.h
+++ b/kernel/defs.h
@@ -63,6 +63,9 @@ void            ramdiskrw(struct buf*);
 void*           kalloc(void);
 void            kfree(void *);
 void            kinit(void);
+void            inc_ref_cnt(uint64);
+uint8           get_ref_cnt(uint64);
+uint8           dec_ref_cnt(uint64);

 // log.c
 void            initlog(int, struct superblock*);
@@ -167,11 +170,13 @@ int             uvmcopy(pagetable_t, pagetable_t, uint64);
 void            uvmfree(pagetable_t, uint64);
 void            uvmunmap(pagetable_t, uint64, uint64, int);
 void            uvmclear(pagetable_t, uint64);
+pte_t*          walkpte(pagetable_t pagetable, uint64 va);
 uint64          walkaddr(pagetable_t, uint64);
 int             copyout(pagetable_t, uint64, char *, uint64);
 int             copyin(pagetable_t, char *, uint64, uint64);
 int             copyinstr(pagetable_t, char *, uint64, uint64);

+
 // plic.c
 void            plicinit(void);
 void            plicinithart(void);
diff --git a/kernel/kalloc.c b/kernel/kalloc.c
index fa6a0ac..8234080 100644
--- a/kernel/kalloc.c
+++ b/kernel/kalloc.c
@@ -14,6 +14,8 @@ void freerange(void *pa_start, void *pa_end);
 extern char end[]; // first address after kernel.
                    // defined by kernel.ld.

+uint8* phy_ref_cnt;
+
 struct run {
   struct run *next;
 };
@@ -23,11 +25,42 @@ struct {
   struct run *freelist;
 } kmem;

+void inc_ref_cnt(uint64 pa) {
+  acquire(&kmem.lock);
+  if(pa >= (uint64)end && pa < PHYSTOP)
+    phy_ref_cnt[(pa-(uint64)end)>>12]++;
+  release(&kmem.lock);
+}
+
+uint8 dec_ref_cnt(uint64 pa) {
+  uint8 result=0;
+  if(get_ref_cnt(pa) == 0)
+    panic("dec_ref_cnt");
+  acquire(&kmem.lock);
+  result = --phy_ref_cnt[(pa-(uint64)end)>>12]; // called by uvmunmap pa is in range
+  release(&kmem.lock);
+  return result;
+}
+
+uint8 get_ref_cnt(uint64 pa) {
+  uint8 result = 1;
+  acquire(&kmem.lock);
+  if(pa >= (uint64)end && pa < PHYSTOP)
+    result = phy_ref_cnt[(pa-(uint64)end)>>12];
+  release(&kmem.lock);
+  return result;
+}
+
 void
 kinit()
 {
   initlock(&kmem.lock, "kmem");
-  freerange(end, (void*)PHYSTOP);
+
+  // caculate the need of ref counts
+  int bytes = (PHYSTOP-PGROUNDUP((uint64)end))>>12;
+  phy_ref_cnt = (uint8*)end;
+
+  freerange((void*)end+bytes, (void*)PHYSTOP);
 }

 void
@@ -72,8 +105,10 @@ kalloc(void)

   acquire(&kmem.lock);
   r = kmem.freelist;
-  if(r)
+  if(r) {
     kmem.freelist = r->next;
+    phy_ref_cnt[((uint64)r-(uint64)end)>>12] = 0;
+  }
   release(&kmem.lock);

   if(r)
diff --git a/kernel/riscv.h b/kernel/riscv.h
index 0aec003..7b5db8a 100644
--- a/kernel/riscv.h
+++ b/kernel/riscv.h
@@ -331,6 +331,7 @@ sfence_vma()
 #define PTE_W (1L << 2)
 #define PTE_X (1L << 3)
 #define PTE_U (1L << 4) // 1 -> user can access
+#define PTE_C (1L << 8)

 // shift a physical address to the right place for a PTE.
 #define PA2PTE(pa) ((((uint64)pa) >> 12) << 10)
diff --git a/kernel/trap.c b/kernel/trap.c
index a63249e..e08ee89 100644
--- a/kernel/trap.c
+++ b/kernel/trap.c
@@ -10,6 +10,7 @@ struct spinlock tickslock;
 uint ticks;

 extern char trampoline[], uservec[], userret[];
+extern uint8* phy_ref_cnt;

 // in kernelvec.S, calls kerneltrap().
 void kernelvec();
@@ -65,6 +66,34 @@ usertrap(void)
     intr_on();

     syscall();
+  } else if(r_scause() == 15) {
+    // write page fault
+    uint64 val = r_stval();
+    if(val >= p->sz) {// check va
+      p->killed = 1;
+      exit(-1);
+    }
+    pte_t *pte = walkpte(p->pagetable, val);
+    if(pte == 0 || !(*pte & PTE_C)) {
+      p->killed = 1;
+      exit(-1);
+    }
+
+    uint64 pa = PTE2PA(*pte);
+    if (get_ref_cnt(pa) == 1) {
+      *pte |= PTE_W;
+    }
+    else {
+      char* mem = (char*)kalloc();
+      if(mem == 0) {
+        p->killed = 1;
+        exit(-1);
+      }
+      memmove(mem, (char*)pa, PGSIZE);
+      uvmunmap(p->pagetable, PGROUNDDOWN(val), 1, 1);
+      mappages(p->pagetable, PGROUNDDOWN(val), PGSIZE, (uint64)mem, PTE_X|PTE_W|PTE_U|PTE_R);
+    }
+
   } else if((which_dev = devintr()) != 0){
     // ok
   } else {
diff --git a/kernel/vm.c b/kernel/vm.c
index bccb405..85c7424 100644
--- a/kernel/vm.c
+++ b/kernel/vm.c
@@ -111,14 +111,46 @@ walkaddr(pagetable_t pagetable, uint64 va)
   return pa;
 }

+pte_t* walkpte(pagetable_t pagetable, uint64 va) {
+  pte_t *pte;
+
+  if(va >= MAXVA)
+    return 0;
+
+  pte = walk(pagetable, va, 0);
+  if(pte == 0)
+    return 0;
+  if((*pte & PTE_V) == 0)
+    return 0;
+  if((*pte & PTE_U) == 0)
+    return 0;
+  return pte;
+}
+
 // add a mapping to the kernel page table.
 // only used when booting.
 // does not flush TLB or enable paging.
 void
 kvmmap(uint64 va, uint64 pa, uint64 sz, int perm)
 {
-  if(mappages(kernel_pagetable, va, sz, pa, perm) != 0)
-    panic("kvmmap");
+  // kvmmap should not increase ref cnt
+  uint64 a, last;
+  pte_t *pte;
+
+  a = PGROUNDDOWN(va);
+  last = PGROUNDDOWN(va + sz - 1);
+  for(;;){
+    if((pte = walk(kernel_pagetable, a, 1)) == 0)
+      panic("kvmmap");
+    if(*pte & PTE_V)
+      panic("remap");
+    *pte = PA2PTE(pa) | perm | PTE_V;
+    if(a == last)
+      break;
+    a += PGSIZE;
+    pa += PGSIZE;
+  }
+
 }

 // translate a kernel virtual address to
@@ -159,6 +191,7 @@ mappages(pagetable_t pagetable, uint64 va, uint64 size, uint64 pa, int perm)
     if(*pte & PTE_V)
       panic("remap");
     *pte = PA2PTE(pa) | perm | PTE_V;
+    inc_ref_cnt(pa);
     if(a == last)
       break;
     a += PGSIZE;
@@ -188,7 +221,8 @@ uvmunmap(pagetable_t pagetable, uint64 va, uint64 npages, int do_free)
       panic("uvmunmap: not a leaf");
     if(do_free){
       uint64 pa = PTE2PA(*pte);
-      kfree((void*)pa);
+      if(!dec_ref_cnt(pa))
+        kfree((void*)pa);
     }
     *pte = 0;
   }
@@ -311,7 +345,6 @@ uvmcopy(pagetable_t old, pagetable_t new, uint64 sz)
   pte_t *pte;
   uint64 pa, i;
   uint flags;
-  char *mem;

   for(i = 0; i < sz; i += PGSIZE){
     if((pte = walk(old, i, 0)) == 0)
@@ -319,14 +352,11 @@ uvmcopy(pagetable_t old, pagetable_t new, uint64 sz)
     if((*pte & PTE_V) == 0)
       panic("uvmcopy: page not present");
     pa = PTE2PA(*pte);
+    *pte = *pte | PTE_C;
+    *pte = *pte & ~PTE_W;
     flags = PTE_FLAGS(*pte);
-    if((mem = kalloc()) == 0)
-      goto err;
-    memmove(mem, (char*)pa, PGSIZE);
-    if(mappages(new, i, PGSIZE, (uint64)mem, flags) != 0){
-      kfree(mem);
+    if(mappages(new, i, PGSIZE, (uint64)pa, flags) != 0)
       goto err;
-    }
   }
   return 0;

@@ -355,15 +385,35 @@ int
 copyout(pagetable_t pagetable, uint64 dstva, char *src, uint64 len)
 {
   uint64 n, va0, pa0;
+  pte_t *pte;

   while(len > 0){
     va0 = PGROUNDDOWN(dstva);
-    pa0 = walkaddr(pagetable, va0);
-    if(pa0 == 0)
+    pte = walkpte(pagetable, va0);
+    if(pte == 0)
       return -1;
+    pa0 = PTE2PA(*pte);
+
+    if((*pte & PTE_C) && !(*pte&PTE_W)) {
+      // dealing with COW page
+      if (get_ref_cnt(pa0) == 1) {
+        *pte |= PTE_W;
+      }
+      else {
+        char* mem = (char*)kalloc();
+        if(mem == 0) {
+          return -1;
+        }
+        memmove(mem, (char*)pa0, PGSIZE);
+        uvmunmap(pagetable, va0, 1, 1);
+        mappages(pagetable, va0, PGSIZE, (uint64)mem, PTE_X|PTE_W|PTE_U|PTE_R);
+        pa0 = (uint64)mem;
+      }
+    }
+
     n = PGSIZE - (dstva - va0);
     if(n > len)
-      n = len;
+      n = len;  // dealing one page a time
     memmove((void *)(pa0 + (dstva - va0)), src, n);

     len -= n;

| 7 | Lab thread 相关

| 7.1 | Chapter 6 & LEC 10 note

concurrency problems - race conditions
broken acquire - 使用硬件提供的原子操作(test&set …)
lock problems - performance/deadlock
mem ordering
sleep lock
- 为什么 sleep lock 不用关闭中断

| 7.2 | Chapter 7 & LEC 11 note & LEC 13 note

为了允许多于 CPU 数量的进程“同时”运行，操作系统需要提供多路复用(Mutiplexing) 包括 CPU 在内的共享资源。

多路复用面临的挑战：

how to switch from one process to another? idea is sample but implementation is opaque.

how to force switches in a way that is transparent to user processes? - timer interrupts

a locking plan is necessary to avoid races, for CPUs may be switching among processes concurrently.

cannot free a process’s memory and other resources when the process exits because (for example) it can’t free its own kernel stack while still using it.

each core of a multi-core machine must remember which process it is executing so that system calls affect the correct process’s kernel state. - myproc()

races(the loss of wakeup) in sleep and wakeup.

| 7.2.1 | switch & scheduling

overview of thread switching in xv6
  (the point: switch among threads to interleave many threads on each CPU)
  [diagram: P1, TF1, STACK1, swtch(), CTX1;
            CTXs, swtch(), STACKs, scheduler(), &c]
  TF = trapframe = saved user registers
  CTX = context = saved RISC-V registers
  getting from one process to another involves multiple transitions:
    - user -> kernel; saves user registers in **trapframe**
    - kernel thread -> scheduler thread; saves kernel thread registers in **context**
    - scheduler thread -> kernel thread; restores kernel thread registers from **context**
    - kernel -> user; restores user registers from **trapframe**

swtch()

只保存被调用者保存(callee saved)寄存器。

swtch() 返回到 new->context->ra 指向的地址，该地址是这个新线程之前调用 swtch() 的返回地址(sched()/forkret()/scheduler())。
进程在准备让出 CPU 之前必须满足几个要求：
- 获得自己的 p->lock，以保护 p->state, cpu->proc。如果没有 p->lock 在当前 CPU 执行 yeild() 和 scheduler():swtch() 之间，其他的 CPU 很可能调度到此进程，此时两个 CPU 共用同一个内核栈。
- 修改进程的状态(p->state)，修改成 SLEPPING/RUNABLE/ZOMBIE。
- 这样的函数有：sleep(), yield(), exit() 。
scheduler()
1. 打开中断，遍历 proc[] 找到一个 RUNNALBE 进程(p)(首先会申请当前进程的锁 p->lock)。
2. 更新 p->state 为 RUNNING, c->proc 为当前进程。
3. 调用swtch() 到进程 p。
4. 当被 sleep()/yield()/exit() 调用 sched() 切换回来时，清空 c->proc, 释放 p->lock。
5. 如果当前只有 init 和 sh 两个进程，则打开中断并执行指令 wfi 等待中断，否则继续执行。
6. 重复 1. 。
mycpu()
- mycpu() 要求调用者关闭中断，直到使用完 mycpu() 返回的值为止。
- 因为如果在这期间被中断，hartid 会被修改，导致索引有误。
myproc()
- 关中断。
- 保存 mycpu()->proc。
- 关中断。在这之后被中断是允许的，因为 proc 结构体不依赖 hartid, 我们只需要根据 hartid 找到当前 CPU 正在执行的进程
- 返回结构体地址。
sched() 持有 p->lock，切换到 scheduler() 后，会释放 p->lock，这两个锁一样吗？– 完全一样
- 因为一个进程 p1 首先是通过 scheduler() 获取了锁，然后再通过 swtch() 返回后释放。
- 而 p1 调用 sched() 回到的 scheduler() 和当初调度他的 scheduler() 是同一个 CPU, 所以此时 scheduler() 上下文的 p 即 p1。
- 此外对于 fork() 的子进程，在 scheduler() 开头遍历时获取 p->lock，在 forkret() 释放。

| 7.2.2 | sleep & wakeup

one process to sleep when waiting for an event and another process to wake it up once the event has happened.

比如以下事件(event)

waiting pipes
disk r/w
sys_wait()

loss wakeups (broken sleep: wakeup before sleep).

loss wakeups 的解决方法:

调用 sleep() 之前必须保证持有调用 wakeup() 也需要的 conditional lock
xv6 通过修改 sleep() 接口，sleep() 接收除了 channel 以外的另一个参数，conditional lock

sleep()

申请 p->lock, 释放 conditional lock。
更新 p->chan = chan, p->state = SLEEPING。
调用 sched() 让出 CPU。
当 p->state 被 wakeup()/kill()/exit() 修改后，会被 sheduler() 调度，从 sched() 返回。
清空 p->chan。
重新申请 conditional lock 后返回。

wakeup()

遍历 procp[] 匹配 p->state == SLEEPING && p->chan == chan, 在这之前会获取 p->lock。
更新 p->state = RUNNABLE。
释放 p->lock 返回。

注意：

通常在调用 wakeup() 之前应该申请相应的 conditional lock， wakeup() 返回后释放。
多个进程使用同一个 sleeping channel 是允许的，因为在 sleep() 返回前会重新申请 conditional lock, 尽管这些进程都会被 wakeup() 但是只有一个能成功申请到锁。

| 7.2.3 | wait, exit & kill

wait()

找到一个 ZOMBIE 子进程。
释放该进程 (freeproc())。
返回子进程 pid。
如果没有没有子进程，或被 kill,则直接返回。
调用 sleep() 等待子进程 exit() 时调用 wakeup1()。

注意：

wait often holds two locks, xv6 must obey the same locking order (parent, then child) in order to avoid deadlock.

wait uses np->parent without holding np->lock. It is possible that np is an ancestor of the current process, in which case acquiring np->lock could cause a deadlock. A process’s parent field is only changed by its parent, so if np->parent==p is true, the value can’t change unless the current process changes it.

在判断 np->parent==p 后才会申请 np->lock

exit()

exit allows a process to terminate itself

关闭了所有已打开的文件
调用 wakeup() 唤醒 initproc，之后 initproc 继续执行 wait()。
申请 p->parent->lock 因为要唤醒父进程。
再申请 p->lock 因为要更新 p->state 且重新设置父进程为(reparent to) initproc。注意由于上面的顺序(memory ordering)先父进程再子进程。
reparent 。
唤醒父进程。
修改 p->state=ZOMBIE。
调用 sched() 让出 CPU。

注意：

在修改 p->state=ZOMBIE 之前唤醒父进程是可以的，因为当父进程从 wait() 的 sleep() 返回时，最终会申请 np->lock。直到 exit() 调用 sched() 之后，scheduler() 才会释放该子进程的锁。

kill()

kill lets one process request that another terminate.

It would be too complex for kill to directly destroy the victim process, since the victim might be executing on another CPU, perhaps in the middle of a sensitive sequence of updates to kernel data structures.
Thus kill does very little:

sets the victim’s p->killed

if it is sleeping, wakes it up.(p->state = RUNNABLE)

注意：

kill 直接唤醒当前进程很可能造成目标进程从 sleep() 中返回。此时很可能当前程序希望 sleep() 等待的条件还没有达成，因而造成程序错误。
所以一些封装 sleep() 的循环，当 sleep() 返回时，需要检查 p->kill
对于一些要求原子操作的循环，比如读写磁盘，则相反必须等原子操作完成后再检查 p->kill

| 7.3 | Lab: Multithreading

| 7.3.1 | Uthread: switching between threads (moderate)

需要注意的是：

初始化 sp=t->stack + STACK_SIZE 即 sp 是递减的。
thread_switch() 之后不用更新 current_thread = t; 因为每次都会从头调用调度器，已经设置了 current_thread。

📌 这部分代码在这里

| 7.3.2 | Using threads (moderate)

📌 这部分代码在这里

| 7.3.3 | Barrier(moderate)

只需要遵守调用 wakeup() 的通用方法| 7.2.2 | sleep & wakeup

📌 这部分代码在这里

| 8 | Lab lock 相关

| 8.1 | Lab: locks

| 8.1.1 | Memory allocator (moderate)

不用对所有 CPU 等分 freelist，因为不确定 CPU 数量不确定，只需要给 CPU0 分配所有内存，其他的 CPU 调用 steal() 即可
steal() 不能传入调用他时的 cpuid 并且不遍历这个 CPU。因为很可能 steal() 执行过程中被调度，更新了 freelist()，不遍历会导致丢失部分内存。

📌 这部分代码在这里

| 8.1.2 | Buffer cache (hard)

用哈希表代替原来 buffer cache 的一个双链表，来避免 lock contention

后补

自我感觉这部分(并发，死锁，条件竞争)很难调试，内核出错的位置和代码错误的位置有一定距离，很难通过 gdb 找到。

LEC 15，Frans 教授提到可以通过 “write down assertion for invariants.”

后面再做吧，先把时间留给文件系统和 mmap。

| 9 | Lab fs 相关

| 9.1 | LEC 14 note

|9.1.1| Buffer cache layer

bread()

bwrite()

brelse()

balloc()

bfree()

|9.1.2| inode layer(icache)

ialloc()

iget()

ilock()

iput()

iupdate()

bmap(): 返回 inode.data.logic_block_number 对应的实际块号(block number)。

itrunc(): 释放 inode 的数据占用的所有块。

readi()

writei()

| 9.2 | LEC 15 note & Chapter 8

|9.2.1| logging layer

日志(log) 机制原理：

write-ahead log rule:

install none of a transaction’s writes to disk

until all writes are in the log on disk,

and the logged writes are marked committed.

组提交(group commit)：

the logging system only commits when no file-system system calls are underway.

commit order can’t be changed

固定日志大小的问题：

一次 write() 的大小不能超过日志定义的块的数量。

unlinking a large file might write many bitmap blocks and an inode.

begin_op()

log_write()

end_op()

commit()

write_log(): 把事务(transaction)涉及的块(保存在内存 log 结构体的 log.lh.block[]) 写到磁盘的存放日志块的区域(block 2-31)。

write_head(): 提交点(commit point), 把内存中的日志头写到磁盘的日志头。

install_trans(): 根据日志头把磁盘中的日志写到对应的位置(home location)。

recover_from_log()

|9.2.2| Directory, pathname, file descriptor layer

dirlookup()

dirlink()

namex(): 根据 path 返回对应 icache 地址。

skipelem(): 返回子路径，设置 name。

// Examples:
//   skipelem("a/bb/c", name) = "bb/c", setting name = "a"
//   skipelem("///a//bb", name) = "bb", setting name = "a"
//   skipelem("a", name) = "", setting name = "a"
//   skipelem("", name) = skipelem("////", name) = 0

filealloc()

filedup()

fileclose()

fileread()

filewrite()

| 9.3 | LEC 16 note

xv6 日志系统的问题：

synchronize: 必须等当前事务完全结束，才能继续。
write twice

ext3:

异步(async): 系统调用(syscall)返回并不代表需要的事务成功完成，所以用户进程必须考虑到这种情况(fsync(fd))
批处理(batching)
- 一次只有一个 “open transaction”，可以包含多个 syscall
并发(concurrency)
- 当前 “open transaction” 的所有系统调用结束后才能结束(close)该事务。
- 结束事务后会缓存到内存中，按顺序依次提交(commit)到磁盘的 log 区域。
- 因此 ext3 不像 xv6，必须等当前事务写入磁盘日志区，提交，install，清空日志区完成后再开始。

ext3 的三种模式：

journaled data: 和 xv6 一样，先写入磁盘日志区，然后再写原本的位置(home location)
ordered data: 只把元数据(metadata)写入磁盘日志区，而文件数据直接写入原本的位置，但是必须注意要先写数据，再写元数据。
writeback

| 9.4 | Lab: file system

| 9.4.1 | Large files (moderate)

📌 这部分代码在这里

| 9.4.2 | Symbolic links (moderate)

📌 这部分代码在这里

| 10 | Lab mmap 相关

| 10.1 | LEC 17 note

一些通过使用内核提供的虚拟内存原语/特性(primitives) 如: trap, prot1,protn, dirty, map2 (sigaction(), mprotect(), mmap())的算法。

VMA(Virtual Memory Area)

记录并描述一段连续的内存区域，包括起始终止地址，权限，相关文件指针等

| 10.2 | LEC 18 note

micro kernel 暂时忽略。

| 10.3 | LEC 19 note

Virtual Machine!!!!

虚拟化需要解决的三个问题：怎样安全的有效的使得虚拟机(guest VM), 执行指令，访问内存，读写设备。

虚拟机不能运行在宿主(host)内核中。

对应三种虚拟化 CPU 虚拟化，内存虚拟化，I/O 虚拟化

辅助参考:🔗 CTF Wiki-Pwn-Virtualization-基础知识

| 10.3.1 | trap-and-emulate

VMM 完全由软件模拟虚拟机(VM)的指令，也就是完全软件虚拟化 CPU是可行的，并且移植性很好，但是最大的问题是完全由软件模拟会很慢。

当 VM 执行特权指令()时，VMM 从 trap 中接手，模拟执行相应特权指令。

比如虚拟机 ecall，但是会陷入(trap)到宿主机内核(host kernel)，然后，VMM 接手，把相关的特权寄存器(scause, sepc, stvac…)复制到 VMM 针对虚拟机模拟的相应特权寄存器中

然后设置宿主 sepc 为模拟的 stvac , 设置虚拟机模式(mode bit)为 supervis mode , 然后宿主机执行 sret 返回到虚拟机内核(guest kernel)。

这样虚拟机的内核才能处理来自虚拟机用户态(guest userspace)的 trap。

当虚拟机内核处理结束执行 sret 的时候，同样会陷入宿主机的内核(trap into host kernel), (因为虚拟机内核实际上是运行在宿主机的用户态)然后宿主机内核交给 VMM 处理。

VMM 模拟相应指令(把模拟 sepc 拷贝到 pc，修改虚拟机为用户模式)回到虚拟机用户态。整个过程就像 guest userspace-> guest kernel space -> guest userspace 一样。

另一个问题：页表(pagetable):

VMM 不能让虚拟机设置真正的 satp，这样会使得虚拟机可以访问任意物理内存。同时 VMM 需要把真实的物理地址(Host Physical Address)针对每一个虚拟机，抽象成虚拟机的物理地址(Guest Physical Address)。
VMM 因此维护一个对应映射表(shallow pagetable)用于设置真正的 satp 使用 MMU 翻译地址, shallow pagetable的内容是 gva:hpa。

最后是设备问题:

VMM 模拟设备。
VMM 虚拟设备，并提供接口给虚拟机。因而虚拟机的设备驱动可以与 VMM 内支持的设备进行高效交互，而不是需要陷入 VMM。
pass-through 例如网卡。

| 10.3.2 | hardware support

intel VT-x

VMCS
EPT

| 10.3.3 | Dune: Safe User-level Access to Privileged CPU Features

一个 Linux 模块(LKM)，给普通程序提供使用 intel VT-x 的接口，实现或者优化沙盒(sandbox)，用户内存管理(memory management)，garbage collector 等。

| 10.4 | LEC 20 note

| 10.5 | Lab: mmap

MAP_SHARED 的问题:

对于两个进程同时 MAP_SHARED 相同文件，之后进程 1 修改文件，该文件在 munmap() 时写进文件系统，并且进程 2 由于在更新前调用的 mmap()，所以不能看见更新，除非进程 1 调用 msync() 并且指定 MS_INVALIDATE 标记进程 2 相应地址无效，使得进程 2 重新读文件。
所以对于 MAP_SHARED 的修改，这里直接写回原来的位置即可，不用管是否其他进程已经 mmap() 相同文件。
不过在 Linux 上测试时发现，进程 2 能够及时发现改变，应该是 Linux 在实现上两个进程用的同一个缓冲区的缘故。

MAP_PRIVATE 的问题：

即使 fd 对应的文件不可写，但是如果设置了 MAP_PRIVATE 仍然可以。

addr 为 0 的问题:

此时意味着内核必须选择一个”合适”的地址。
xv6 的 sbrk() 会一直增加虚拟地址空间，直到没有内存可以分配。
并且 mmap() 只是把文件映射到进程的虚拟地址空间，不应该增加 proc->sz
因而 mmap() 选择的地址应该和 sbrk() 分开。

length 和 off 的问题：

length 不用限制，如果超出文件的大小，也没事，因为会根据 length 分配物理内存。
off 大于文件的大小则应该返回错误。
readi(), 和 writei() 都应该基于 off。

munmp()问题：

munmap() 之后，访问 [addr, addr+length) 是非法的。
如果 length 不是页对其的，应该向上取整，因为之前 mmap() 的时候不足一页的会补齐。
如果 mmap() 了连续的 3 页，但是只 munmap() 第二页，其他两页应该继续有效，这是需要再找一个 VMA，但是 mmaptest 没有这种情况，所以暂时不考虑。
由于我们是延迟映射(lazy mmap) 所以在 munmap() 的时候，walk() 返回 0 或者 PTE_V 是 0 是很正常的。。

fork() 问题：

子进程可以看见父进程 MAP_PRIVATE 的改变。
所以子进程应该完全复制一份父进程的 VMA 空间(这里就不实现 COW 了)。
如果子进程在 uvmcopy() 第一个以后的 VMA 失败了(比如当父进程 mmap 了很大的文件，并且都分配了内存)，应该释放在前面分配的 VMA，否则就会损失这些 VMA，这个还没实现。

📌 这部分代码在这里

| 11 | Lab net 相关

后半学期有一门《网络协议栈分析》的课，留在后面吧。

| 12 | Meltdown & RCU

| 12.1 | LEC 22 note

intel CPU 特性，预测执行时不检查 PTE_U。同时即使非法指令被退回(retired)，缓存(cache)的内容仍然在，通过 Flush+Reload，得到缓存的内容进而判断内核虚拟地址的内容。

| 12.2 | LEC 23 note

RCU(Read-Copy-Update) 出现的原因：

对于某些共享数据结构，读的时间远大于写的时间，自旋锁(spinlock)互斥了多个读者(reader)同时访问的情况，这是可以优化的，比如通过读写锁(r/w lock)。
读写锁同样存在问题：读者会修改引用计数(ref count)，导致只读的操作增加了写，同时 n 个 CPU 有 n 个读者同时访问，复杂度是 O(n^2)，因为后来的读者会不断的循环 CAS。

RCU 实现方式：

写者(writer)仍然需要申请锁。
当写者要更新数据时，复制一份原数据，修改完成后提交到原来的位置(单链表，树等)这个提交的过程 Robert 教授称之为 commiting write，这个提交点应该是原子的，所以 RCU 对于双链表不太友好，因为需要修改两个指针。在这个提交点前，新来的读者不会发现修改，提交之后数据被完整的更新完毕。

Memory Barriers：

RCU 依赖新的项(entry)加入操作的代码执行顺序，必须先更新好新的 entry 再加入，但是编译器和 CPU 很可能重排这些操作，所以需要 barriers 来保证。
写者需要在提交点(commiting write)前添加 barrier。
理论上，读者也需要先保存读取节点(E2)前一个节点(E1)的地址即 r1 = E1->next，再访问 r1，因为可能存在 E1->next 对应地址的一个旧版本。

写者什么时候释放旧的项(entry)：

很明显不能使用引用计数(ref count)，这违背了 RCU 的原则。
如果有 Garbage Collector，也不用担心。
读者在 RCU 区域期间，不能被 trap 即上下文切换(context switch)。
写者等所有 CPU 都执行一次上下文切换后，释放。