Some argue that on x86 a syscall implies a context switch... Actually the kernel code runs in two possible contexts:
- in kernel-space mode, in process context (on behalf of a specific process)
- in kernel-space, in interrupt context (not bound to any process)
The CPU executes either userland code in user space, or kernel code in one of the two above-mentioned contexts.
By context switch we usually mean changing the current process; it is what happens when the current PID changes (when the scheduler preempts a process); In the following lines I'd like to stress that a syscall happens without this context switch.
_kernel_vsyscall() looks like this:
ffffe400 <__kernel_vsyscall>:
ffffe400: 51 push %ecx
ffffe401: 52 push %edx
ffffe402: 55 push %ebp
ffffe403: 89 e5 mov %esp,%ebp
ffffe405: 0f 34 sysenter
ffffe407: 90 nop
ffffe408: 90 nop
ffffe409: 90 nop
ffffe40a: 90 nop
ffffe40b: 90 nop
ffffe40c: 90 nop
ffffe40d: 90 nop
ffffe40e: eb f3 jmp ffffe403 <__kernel_vsyscall x3="x3">
ffffe410: 5d pop %ebp
ffffe411: 5a pop %edx
ffffe412: 59 pop %ecx
ffffe413: c3 ret
esp = stack pointer
eip = instruction pointer
Explanation:
- after moving to this address, registers %ecx, %edx and %ebp are saved on the user stack and %esp is copied to %ebp before executing sysenter (this %ebp later helps kernel in restoring userland stack)
- jmp __kernel_vsyscall+0x3 is just a trick made in order to be able to work with 6 arguments instead of 3; the standard max number of args for a syscall is 6:
- we make sysenter twice (the second sysenter has no impact: sysenter is just "restarted") - see https://lkml.org/lkml/2002/12/18/218 (Linus is a "disguisting pig") :)
- sysenter is executed; this will bring the CPU in Ring0 (a.k.a. CPL=0)
- sysenter (fast system call facility on x86) does the following:
- CS register set to the value of (SYSENTER_CS_MSR)
- EIP register set to the value of (SYSENTER_EIP_MSR)
- SS register set to the sum of (8 plus the value in SYSENTER_CS_MSR)
- ESP register set to the value of (SYSENTER_ESP_MSR)
- Intel defines these SFRs:
SYSENTER_CS_MSR=0x174
SYSENTER_ESP_MSR=0x175
SYSENTER_EIP_MSR=0x176
- these values are defined in linux in /usr/src/linux/include/asm/msr.h:
#define MSR_IA32_SYSENTER_CS 0x174
#define MSR_IA32_SYSENTER_ESP 0x175
#define MSR_IA32_SYSENTER_EIP 0x176
- at bootup, linux sets this values in a special page (/usr/src/linux/arch/i386/kernel/sysenter.c):
wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp1, 0);
wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0);
- wrmsr writes the contents of registers EDX:EAX into the 64-bit model specific register (MSR) specified in the ECX register
- 'tss' refers to the Task State Segment (TSS) and tss->esp1 thus points to the kernel mode stack
- Although Linux doesn't use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system. This is done for two main reasons:
- When an x86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS
- When a User Mode process attempts to access an I/O port by means of an in or out instruction, the CPU may need to access an I/O Permission Bitmap stored in the TSS to verify whether the process is allowed to address the port.
- So during initialization kernel sets up these registers such that after SYSENTER instruction, ESP is set to kernel mode stack and EIP is set to sysenter_entry
- ESP is set to kernel mode stack and EIP is set to sysenter_entry
- now the kernel executes the following code (we are in Ring0 and we are executing kernel code, but the current PID is the old PID!
- the calling user thread is still at the sysenter line;
- context switching is done before returning to user space
- in linux context switching is made in software, not in hardware
- however, linux uses TSS for every process it creates - it creates/stores the TSS entry for the process at process creation):
- When a transition between user mode and kernel mode is required in an operating system, a context switch is not necessary; a mode transition is not by itself a context switch.
However, depending on the operating system, a context switch may also take place at this time.(http://en.wikipedia.org/wiki/Context_switch#User_and_kernel_mode_switching)
179 ENTRY(sysenter_entry)
180 movl TSS_sysenter_esp0(%esp),%esp
181 sysenter_past_esp:
182 sti
183 pushl $(__USER_DS)
184 pushl %ebp [%ebp contains userland %esp]
185 pushfl
186 pushl $(__USER_CS)
187 pushl $SYSENTER_RETURN [%userland return addr]
188
...
201 pushl %eax
202 SAVE_ALL [pushes registers on to stack]
203 GET_THREAD_INFO(%ebp)
204
205 /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */
206 testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT), TI_flags(%ebp)
207 jnz syscall_trace_entry
208 cmpl $(nr_syscalls), %eax
209 jae syscall_badsys
210 call *sys_call_table(,%eax,4)
211 movl %eax,EAX(%esp)
#define SAVE_ALL \
cld; \
pushl %es; \
pushl %ds; \
pushl %eax; \
pushl %ebp; \
pushl %edi; \
pushl %esi; \
pushl %edx; \
pushl %ecx; \
pushl %ebx; \
movl $(__USER_DS), %edx; \
movl %edx, %ds; \
movl %edx, %es;
In conclusion, what happens is not a context switch, but a mode transition. The current PID is the same. And, btw, user preemption can only happen in one of the 2 situations:
- After syscall finishes and the need_resched flag is set
- After interrupt finishes and need_resched flag is set
The need_resched flag is set by the scheduler tick when a thread needs to be preempted or by try_to_wake_up() when a higher priority process can be awakened.