为了节约物理内存,减少进程创建时资源和时间的消耗,父进程在调用fork()生成子进程时,子进程与父进程会共享同一内存区。只有当其中一进程进行写操作时,系统才会为其另外分配内存页面。这就是写时复制机制(copy on write)的意思。那么Linux内核是如何实现这种机制的呢,今天我们来简要的分析一下。

fork系统调用流程

sys_fork -> _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0) -> copy_process() -> copy_mm(clone_flags, p)

static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
{
struct mm_struct *mm, *oldmm;
...
if (clone_flags & CLONE_VM) {
mmget(oldmm);
mm = oldmm;
goto good_mm;
}
retval = -ENOMEM;
mm = dup_mm(tsk);
...
good_mm:
tsk->mm = mm;
tsk->active_mm = mm;
return 0;
fail_nomem:
return retval;
}

copy_mm的逻辑为,如果clone_flags指定了CLONE_VM,对应的clone()系统调用创建线程,则共享父进程的mm结构;否则属于创建进程需要调用dup_mm,dup_mm进而调用dup_mmap函数。

dup_mm先给子进程分配了一个新的结构体,然后调用dup_mmap拷贝父进程地址空间,所以我们再进入 dup_mmap看看拷贝了什么东西,因为dup_mmap函数代码太长就不贴出来了,直接看copy_page_range函数,这个函数负责页表得拷贝,我们知道Linux从2.6.11开始采用四级分页模型,分别是pgd、pud、pmd、pte,所以从copy_page_range一直调用到copy_pte_range都是拷贝相应的页表条目,最后我们再来看看copy_pte_range调用的copy_one_pte函数

static inline void
copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
unsigned long addr, int *rss)
{
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
...
/*
* If it's a COW mapping, write protect it both
* in the parent and the child
*/
if (is_cow_mapping(vm_flags)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
...

out_set_pte:
set_pte_at(dst_mm, addr, dst_pte, pte);
}

上面的代码判断如果父进程的页支持写时复制,就将父子进程的页都置为写保护,清除pte的_PAGE_BIT_RW标记。

至此fork系统调用就完成了,那么当父进程或者子进程尝试写共享物理页时,内核是怎么拷贝物理页面的呢?

写共享物理页

当父进程A或子进程B任何一方对这些已共享的物理页面执行写操作时,都会产生页面出错异常(page_fault int14)中断,会将flags & FAULT_FLAG_WRITE,然后通过do_page_fault() -> handle_mm_fault() -> handle_pte_fault()调用链解决这个异常。

static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
...
if (vmf->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(vmf);
entry = pte_mkdirty(entry);
}
...
}

pte_write会根据pte_flags(pte) & _PAGE_RW判断页是否有写保护,这个标记是之前fork时clear掉的,所以会接着调用do_wp_page

/*
* This routine handles present pages, when users try to write
* to a shared page. It is done by copying the page to a new address
* and decrementing the shared-page counter for the old page.
* 当用户试图写入共享页面时,此例程处理当前页面。将页面复制到一个新地址并减少旧页面的共享页面计数器。
* ...
*/
static vm_fault_t do_wp_page(struct vm_fault *vmf)
__releases(vmf->ptl)
{
...
return wp_page_copy(vmf);
}

/*
* Handle the case of a page which we actually need to copy to a new page.
*
* Called with mmap_sem locked and the old page referenced, but
* without the ptl held.
*
* High level logic flow:
*
* - Allocate a page, copy the content of the old page to the new one.
* - Handle book keeping and accounting - cgroups, mmu-notifiers, etc.
* - Take the PTL. If the pte changed, bail out and release the allocated page
* - If the pte is still the way we remember it, update the page table and all
* relevant references. This includes dropping the reference the page-table
* held to the old page, as well as updating the rmap.
* - In any case, unlock the PTL and drop the reference we took to the old page.
*/
static vm_fault_t wp_page_copy(struct vm_fault *vmf)

wp_page_copy函数就不具体分析了,主要就是分配一个页面,将旧页面的内容复制到新页面。

至此,父进程子进程各自拥有一块内容相同的物理页面。最后,从异常处理函数中返回时,CPU就会重新执行刚才导致异常的写入操作指令,使进程继续执行下去。