Tuesday, April 12, 2011

File.list() issue in Java

Maybe some of you have encountered problems with filenames that are incorrectly treated by the JVM. To be more precise, File.list() and any other method that uses a file name can give a FileNotFoundException() if the file name contains characters that are encoded with another charset as JVM thinks.


What happens?
Each file system has a so-called "io charset" setting used to transform the chars from the file name in bytes. Most Linux file systems use ISO-8859-1 or ISO-8859-15 by default. The JVM tries to figure out at startup the encoding used on the operating system. If some system variables are not set (like LC_ALL, LANG, etc), it will choose ISO-8859-1 by default. Sometimes however it may guess a wrong charset and then what happens is that it will transform the bytes corresponding to the file name (which it receives from the file system) to wrong chars! :) Of course this happens *only* for some, "special" chars: the chars that have different char points in these 2 charsets; for 'a'-'Z', 0-9 (all the ASCII chars) the char points are the same in all charsets. Not the same thing can be said about German umlauts for instance.


How can this be fixed?
Don't forget to set -Dos.encoding as system property.


file.encoding system property doesn't have any effect on this issue (contrary to what you'll see in some posts). file.encoding affects only the charset of the stream reader.

No comments: