Thursday, December 20, 2012

Syscall - part 2


Some argue that on x86 a syscall implies a context switch... Actually the kernel code runs in two possible contexts:

  • in kernel-space mode, in process context (on behalf of a specific process)
  • in kernel-space, in interrupt context (not bound to any process)

The CPU executes either userland code in user space, or kernel code in one of the two above-mentioned contexts.

By context switch we usually mean changing the current process; it is what happens when the current PID changes (when the scheduler preempts a process); In the following lines I'd like to stress that a syscall happens without this context switch.

_kernel_vsyscall() looks like this:

ffffe400 <__kernel_vsyscall>:
ffffe400:       51                      push   %ecx
ffffe401:       52                      push   %edx
ffffe402:       55                      push   %ebp
ffffe403:       89 e5                   mov    %esp,%ebp
ffffe405:       0f 34                   sysenter 
ffffe407:       90                      nop    
ffffe408:       90                      nop    
ffffe409:       90                      nop    
ffffe40a:       90                      nop    
ffffe40b:       90                      nop    
ffffe40c:       90                      nop    
ffffe40d:       90                      nop    
ffffe40e:       eb f3                   jmp    ffffe403 <__kernel_vsyscall x3="x3">
ffffe410:       5d                      pop    %ebp
ffffe411:       5a                      pop    %edx
ffffe412:       59                      pop    %ecx
ffffe413:       c3                      ret

esp = stack pointer
eip = instruction pointer

Explanation:
- after moving to this address, registers %ecx, %edx and %ebp are saved on the user stack and %esp is copied to %ebp before executing sysenter (this %ebp later helps kernel in restoring userland stack)

- jmp __kernel_vsyscall+0x3 is just a trick made in order to be able to work with 6 arguments instead of 3; the standard max number of args for a syscall is 6:
- we make sysenter twice (the second sysenter has no impact: sysenter is just "restarted") - see https://lkml.org/lkml/2002/12/18/218 (Linus is a "disguisting pig") :)

- sysenter is executed; this will bring the CPU in Ring0 (a.k.a. CPL=0)

- sysenter (fast system call facility on x86) does the following:
- CS register set to the value of (SYSENTER_CS_MSR)
- EIP register set to the value of (SYSENTER_EIP_MSR)
- SS register set to the sum of (8 plus the value in SYSENTER_CS_MSR)
- ESP register set to the value of (SYSENTER_ESP_MSR)
- Intel defines these SFRs:
SYSENTER_CS_MSR=0x174
SYSENTER_ESP_MSR=0x175
SYSENTER_EIP_MSR=0x176
- these values are defined in linux in /usr/src/linux/include/asm/msr.h:
#define MSR_IA32_SYSENTER_CS            0x174
#define MSR_IA32_SYSENTER_ESP           0x175
#define MSR_IA32_SYSENTER_EIP           0x176

- at bootup, linux sets this values in a special page (/usr/src/linux/arch/i386/kernel/sysenter.c):
wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp1, 0);
wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0);

- wrmsr writes the contents of registers EDX:EAX into the 64-bit model specific register (MSR) specified in the ECX register

- 'tss' refers to the Task State Segment (TSS) and tss->esp1 thus points to the kernel mode stack

- Although Linux doesn't use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system. This is done for two main reasons:
- When an x86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS
- When a User Mode process attempts to access an I/O port by means of an in or out instruction, the CPU may need to access an I/O Permission Bitmap stored in the TSS to verify whether the process is allowed to address the port.

- So during initialization kernel sets up these registers such that after SYSENTER instruction, ESP is set to kernel mode stack and EIP is set to sysenter_entry

- ESP is set to kernel mode stack and EIP is set to sysenter_entry

- now the kernel executes the following code (we are in Ring0 and we are executing kernel code, but the current PID is the old PID! 
- the calling user thread is still at the sysenter line;
- context switching is done before returning to user space
- in linux context switching is made in software, not in hardware
- however, linux uses TSS for every process it creates - it creates/stores the TSS entry for the process at process creation):

- When a transition between user mode and kernel mode is required in an operating system, a context switch is not necessary; a mode transition is not by itself a context switch. 

However, depending on the operating system, a context switch may also take place at this time.(http://en.wikipedia.org/wiki/Context_switch#User_and_kernel_mode_switching)

179 ENTRY(sysenter_entry)
180         movl TSS_sysenter_esp0(%esp),%esp
181 sysenter_past_esp:
182         sti
183         pushl $(__USER_DS)
184         pushl %ebp [%ebp contains userland %esp]
185         pushfl
186         pushl $(__USER_CS)
187         pushl $SYSENTER_RETURN [%userland return addr]
188
...
201         pushl %eax
202         SAVE_ALL [pushes registers on to stack]
203         GET_THREAD_INFO(%ebp)
204
205         /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */
206         testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT), TI_flags(%ebp)
207         jnz syscall_trace_entry
208         cmpl $(nr_syscalls), %eax
209         jae syscall_badsys

210         call *sys_call_table(,%eax,4)

211         movl %eax,EAX(%esp)

#define SAVE_ALL \
cld; \
pushl %es; \
pushl %ds; \
pushl %eax; \
pushl %ebp; \
pushl %edi; \
pushl %esi; \
pushl %edx; \
pushl %ecx; \
pushl %ebx; \
movl $(__USER_DS), %edx; \
movl %edx, %ds; \
movl %edx, %es;

In conclusion, what happens is not a context switch, but a mode transition. The current PID is the same. And, btw, user preemption can only happen in one of the 2 situations:

  • After syscall finishes and the need_resched flag is set
  • After interrupt finishes and need_resched flag is set
The need_resched flag is set by the scheduler tick when a thread needs to be preempted or by try_to_wake_up() when a higher priority process can be awakened.

No comments: