Thursday, December 20, 2012

Syscall - part 2


Some argue that on x86 a syscall implies a context switch... Actually the kernel code runs in two possible contexts:

  • in kernel-space mode, in process context (on behalf of a specific process)
  • in kernel-space, in interrupt context (not bound to any process)

The CPU executes either userland code in user space, or kernel code in one of the two above-mentioned contexts.

By context switch we usually mean changing the current process; it is what happens when the current PID changes (when the scheduler preempts a process); In the following lines I'd like to stress that a syscall happens without this context switch.

_kernel_vsyscall() looks like this:

ffffe400 <__kernel_vsyscall>:
ffffe400:       51                      push   %ecx
ffffe401:       52                      push   %edx
ffffe402:       55                      push   %ebp
ffffe403:       89 e5                   mov    %esp,%ebp
ffffe405:       0f 34                   sysenter 
ffffe407:       90                      nop    
ffffe408:       90                      nop    
ffffe409:       90                      nop    
ffffe40a:       90                      nop    
ffffe40b:       90                      nop    
ffffe40c:       90                      nop    
ffffe40d:       90                      nop    
ffffe40e:       eb f3                   jmp    ffffe403 <__kernel_vsyscall x3="x3">
ffffe410:       5d                      pop    %ebp
ffffe411:       5a                      pop    %edx
ffffe412:       59                      pop    %ecx
ffffe413:       c3                      ret

esp = stack pointer
eip = instruction pointer

Explanation:
- after moving to this address, registers %ecx, %edx and %ebp are saved on the user stack and %esp is copied to %ebp before executing sysenter (this %ebp later helps kernel in restoring userland stack)

- jmp __kernel_vsyscall+0x3 is just a trick made in order to be able to work with 6 arguments instead of 3; the standard max number of args for a syscall is 6:
- we make sysenter twice (the second sysenter has no impact: sysenter is just "restarted") - see https://lkml.org/lkml/2002/12/18/218 (Linus is a "disguisting pig") :)

- sysenter is executed; this will bring the CPU in Ring0 (a.k.a. CPL=0)

- sysenter (fast system call facility on x86) does the following:
- CS register set to the value of (SYSENTER_CS_MSR)
- EIP register set to the value of (SYSENTER_EIP_MSR)
- SS register set to the sum of (8 plus the value in SYSENTER_CS_MSR)
- ESP register set to the value of (SYSENTER_ESP_MSR)
- Intel defines these SFRs:
SYSENTER_CS_MSR=0x174
SYSENTER_ESP_MSR=0x175
SYSENTER_EIP_MSR=0x176
- these values are defined in linux in /usr/src/linux/include/asm/msr.h:
#define MSR_IA32_SYSENTER_CS            0x174
#define MSR_IA32_SYSENTER_ESP           0x175
#define MSR_IA32_SYSENTER_EIP           0x176

- at bootup, linux sets this values in a special page (/usr/src/linux/arch/i386/kernel/sysenter.c):
wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp1, 0);
wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0);

- wrmsr writes the contents of registers EDX:EAX into the 64-bit model specific register (MSR) specified in the ECX register

- 'tss' refers to the Task State Segment (TSS) and tss->esp1 thus points to the kernel mode stack

- Although Linux doesn't use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system. This is done for two main reasons:
- When an x86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS
- When a User Mode process attempts to access an I/O port by means of an in or out instruction, the CPU may need to access an I/O Permission Bitmap stored in the TSS to verify whether the process is allowed to address the port.

- So during initialization kernel sets up these registers such that after SYSENTER instruction, ESP is set to kernel mode stack and EIP is set to sysenter_entry

- ESP is set to kernel mode stack and EIP is set to sysenter_entry

- now the kernel executes the following code (we are in Ring0 and we are executing kernel code, but the current PID is the old PID! 
- the calling user thread is still at the sysenter line;
- context switching is done before returning to user space
- in linux context switching is made in software, not in hardware
- however, linux uses TSS for every process it creates - it creates/stores the TSS entry for the process at process creation):

- When a transition between user mode and kernel mode is required in an operating system, a context switch is not necessary; a mode transition is not by itself a context switch. 

However, depending on the operating system, a context switch may also take place at this time.(http://en.wikipedia.org/wiki/Context_switch#User_and_kernel_mode_switching)

179 ENTRY(sysenter_entry)
180         movl TSS_sysenter_esp0(%esp),%esp
181 sysenter_past_esp:
182         sti
183         pushl $(__USER_DS)
184         pushl %ebp [%ebp contains userland %esp]
185         pushfl
186         pushl $(__USER_CS)
187         pushl $SYSENTER_RETURN [%userland return addr]
188
...
201         pushl %eax
202         SAVE_ALL [pushes registers on to stack]
203         GET_THREAD_INFO(%ebp)
204
205         /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */
206         testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT), TI_flags(%ebp)
207         jnz syscall_trace_entry
208         cmpl $(nr_syscalls), %eax
209         jae syscall_badsys

210         call *sys_call_table(,%eax,4)

211         movl %eax,EAX(%esp)

#define SAVE_ALL \
cld; \
pushl %es; \
pushl %ds; \
pushl %eax; \
pushl %ebp; \
pushl %edi; \
pushl %esi; \
pushl %edx; \
pushl %ecx; \
pushl %ebx; \
movl $(__USER_DS), %edx; \
movl %edx, %ds; \
movl %edx, %es;

In conclusion, what happens is not a context switch, but a mode transition. The current PID is the same. And, btw, user preemption can only happen in one of the 2 situations:

  • After syscall finishes and the need_resched flag is set
  • After interrupt finishes and need_resched flag is set
The need_resched flag is set by the scheduler tick when a thread needs to be preempted or by try_to_wake_up() when a higher priority process can be awakened.

Floating point operations: hardware vs software


This is a (maybe incomplete) list of floating point operations in GCC:
__udivsi3
_vfprintf_r
__eqdf2
__nedf2
__umoddi3
__udivdi3
__ltdf2
__eqdf2
__nedf2
__negdf2
__udivsi3
__eqdf2
__subdf3
__muldf3
__adddf3
__floatsidf
__muldf3
__adddf3
__fixdfsi
__ltdf2
__gtdf2
__divdf3
__muldf3
__divdf3
__ltdf2
__muldf3
__floatsidf
__muldf3
__adddf3
__subdf3
__gtdf2
__negdf2
__gtdf2
__muldf3
__gedf2
__divdf3

Depending on gcc/g++ compilation flag (e.g. -msoft-float), the linker will link these instructions to either calls to hardware FPU (x87 on x86 machines), or use user-defined libraries which implement the above-mentioned functions. 


The only difference between hw versus sw FP operations should be speed, because in sw the same algorithms are applied as in hw.


An example of fp div operation (from Apple oss clang implementation): http://www.opensource.apple.com/source/clang/clang-163.7.1/src/projects/compiler-rt/lib/divdf3.c


In reality, some FPUs have certain features like not making the rounding until the operand is written to memory; this causes some rounding improvements over sw implementations.

However, for precise FP operations (e.g. like the ones used in financial applications), libraries exist which reduce even more the rounding errors: e.g. Java's BigDecimal, GMP (http://gmplib.org/), etc.