Tuesday, April 12, 2011

File.list() issue in Java

Maybe some of you have encountered problems with filenames that are incorrectly treated by the JVM. To be more precise, File.list() and any other method that uses a file name can give a FileNotFoundException() if the file name contains characters that are encoded with another charset as JVM thinks.


What happens?
Each file system has a so-called "io charset" setting used to transform the chars from the file name in bytes. Most Linux file systems use ISO-8859-1 or ISO-8859-15 by default. The JVM tries to figure out at startup the encoding used on the operating system. If some system variables are not set (like LC_ALL, LANG, etc), it will choose ISO-8859-1 by default. Sometimes however it may guess a wrong charset and then what happens is that it will transform the bytes corresponding to the file name (which it receives from the file system) to wrong chars! :) Of course this happens *only* for some, "special" chars: the chars that have different char points in these 2 charsets; for 'a'-'Z', 0-9 (all the ASCII chars) the char points are the same in all charsets. Not the same thing can be said about German umlauts for instance.


How can this be fixed?
Don't forget to set -Dos.encoding as system property.


file.encoding system property doesn't have any effect on this issue (contrary to what you'll see in some posts). file.encoding affects only the charset of the stream reader.

System calls (part 1)

Definition: a controlled entry point into the kernel. A way by which user programs can execute functions that require a greater privilege.


A few words about privileges... In general, each CPU / microcontroller has a set of operating modes. Among this set of modes, some of them concern security; without going to deep into details, security means:
- what memory areas can be read/written (depending on the memory map, some ranges from the addressing space might point to I/O devices - this happens for memory-mapped I/O)
- what instructions can be executed (I'm referring to microinstructions)
The current IA-32 architecture has 4 so-called privilege or protection rings. Ring0 is the level with the most privileges, ring1 is next, and ring3 is the last. 
[Terminology: rings are also called current protection level (CPL) => sometimes we see the term ring0 to 4, sometimes we see the term CPL-0 to CPL-4.]
Software runs in one of these 4 rings. 
Operating systems manage the switching from one ring to another, by calling CPU instructions that effectively do this. Kernel runs in ring0, device drivers run in ring1, user applications usually run in ring3, which restricts access certain functions (like memory map for instance) that would impact the correct behavior of other applications. Since the kernel is the only code to run in ring0, it can control which application can run in which ring. An application who runs in a low-privileged ring cannot force the CPU to switch to a higher-privilege ring, because it simply hasn't the right to execute the instructions that change the CPU state.


Now let's see what happens when a user application makes a system call.
Each OS has an API that can be accessed by user applications. Among the provided functions are: I/O functions, processes management, IPC, etc. I will talk what happens when making a Linux system call:
1. the application program makes a system call by calling a wrapper function in the C library (glibc, etc), like for instance, fopen.
2. the wrapper function must make all of the system call parameters available to the system function. It receives the parameters in the stack (user process' stack), but must copy them in some registers, in order to pass them to the kernel. These registers are %ecx (counter register), %edx (data register) and %ebp (base pointer) are saved on the user stack and %esp (stack pointer) is copied to %ebp before executing sysenter (it helps in restoring the user stack).
3. After executing sysenter instruction, processor starts execution at sysenter_entry. sysenter_entry is defined in/usr/src/linux/arch/i386/kernel/entry.S
4. wrapper calls __kernel_vsyscall function. This is the function that Address of __kernel_vsyscall is not fixed. The kernel passes this address to user processes using AT_SYSINFO elf parameter.


Of course, user programs can make directly a system call, but wrapper functions provide a more user-friendly way of making these calls.