深入了解Linux - COW写时拷贝实现原理

为了节约物理内存，减少进程创建时资源和时间的消耗，父进程在调用fork()生成子进程时，子进程与父进程会共享同一内存区。只有当其中一进程进行写操作时，系统才会为其另外分配内存页面。这就是写时复制机制(copy on write)的意思。那么Linux内核是如何实现这种机制的呢，今天我们来简要的分析一下。

fork系统调用流程

sys_fork -> _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0) -> copy_process() -> copy_mm(clone_flags, p)

static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
{
    struct mm_struct *mm, *oldmm;
    ...
    if (clone_flags & CLONE_VM) {
        mmget(oldmm);
        mm = oldmm;
        goto good_mm;
    }
    retval = -ENOMEM;
    mm = dup_mm(tsk);
    ...
good_mm:
    tsk->mm = mm;
    tsk->active_mm = mm;
    return 0;
fail_nomem:
    return retval;
}

copy_mm的逻辑为，如果clone_flags指定了CLONE_VM，对应的clone()系统调用创建线程，则共享父进程的mm结构；否则属于创建进程需要调用dup_mm，dup_mm进而调用dup_mmap函数。

dup_mm先给子进程分配了一个新的结构体，然后调用dup_mmap拷贝父进程地址空间，所以我们再进入 dup_mmap看看拷贝了什么东西，因为dup_mmap函数代码太长就不贴出来了，直接看copy_page_range函数，这个函数负责页表得拷贝，我们知道Linux从2.6.11开始采用四级分页模型，分别是pgd、pud、pmd、pte，所以从copy_page_range一直调用到copy_pte_range都是拷贝相应的页表条目，最后我们再来看看copy_pte_range调用的copy_one_pte函数

static inline void
copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
        pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
        unsigned long addr, int *rss)
{
    unsigned long vm_flags = vma->vm_flags;
    pte_t pte = *src_pte;
    struct page *page;
    ...
    /*
     * If it's a COW mapping, write protect it both
     * in the parent and the child
     */
    if (is_cow_mapping(vm_flags)) {
        ptep_set_wrprotect(src_mm, addr, src_pte);
        pte = pte_wrprotect(pte);
    }
    ...

out_set_pte:
    set_pte_at(dst_mm, addr, dst_pte, pte);
}

上面的代码判断如果父进程的页支持写时复制，就将父子进程的页都置为写保护，清除pte的_PAGE_BIT_RW标记。

至此fork系统调用就完成了，那么当父进程或者子进程尝试写共享物理页时，内核是怎么拷贝物理页面的呢？

写共享物理页

当父进程A或子进程B任何一方对这些已共享的物理页面执行写操作时，都会产生页面出错异常（page_fault int14）中断，会将flags & FAULT_FLAG_WRITE，然后通过do_page_fault() -> handle_mm_fault() -> handle_pte_fault()调用链解决这个异常。

static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
    ...
    if (vmf->flags & FAULT_FLAG_WRITE) {
        if (!pte_write(entry))
            return do_wp_page(vmf);
        entry = pte_mkdirty(entry);
    }
    ...
}

pte_write会根据pte_flags(pte) & _PAGE_RW判断页是否有写保护，这个标记是之前fork时clear掉的，所以会接着调用do_wp_page

/*
 * This routine handles present pages, when users try to write
 * to a shared page. It is done by copying the page to a new address
 * and decrementing the shared-page counter for the old page.
 * 当用户试图写入共享页面时，此例程处理当前页面。将页面复制到一个新地址并减少旧页面的共享页面计数器。
 * ...
 */
static vm_fault_t do_wp_page(struct vm_fault *vmf)
    __releases(vmf->ptl)
{
    ...
    return wp_page_copy(vmf);
}

/*
 * Handle the case of a page which we actually need to copy to a new page.
 *
 * Called with mmap_sem locked and the old page referenced, but
 * without the ptl held.
 *
 * High level logic flow:
 *
 * - Allocate a page, copy the content of the old page to the new one.
 * - Handle book keeping and accounting - cgroups, mmu-notifiers, etc.
 * - Take the PTL. If the pte changed, bail out and release the allocated page
 * - If the pte is still the way we remember it, update the page table and all
 *   relevant references. This includes dropping the reference the page-table
 *   held to the old page, as well as updating the rmap.
 * - In any case, unlock the PTL and drop the reference we took to the old page.
 */
static vm_fault_t wp_page_copy(struct vm_fault *vmf)

wp_page_copy函数就不具体分析了，主要就是分配一个页面，将旧页面的内容复制到新页面。

至此，父进程子进程各自拥有一块内容相同的物理页面。最后，从异常处理函数中返回时，CPU就会重新执行刚才导致异常的写入操作指令，使进程继续执行下去。