What happens?
Each file system has a so-called "io charset" setting used to transform the chars from the file name in bytes. Most Linux file systems use ISO-8859-1 or ISO-8859-15 by default. The JVM tries to figure out at startup the encoding used on the operating system. If some system variables are not set (like LC_ALL, LANG, etc), it will choose ISO-8859-1 by default. Sometimes however it may guess a wrong charset and then what happens is that it will transform the bytes corresponding to the file name (which it receives from the file system) to wrong chars! :) Of course this happens *only* for some, "special" chars: the chars that have different char points in these 2 charsets; for 'a'-'Z', 0-9 (all the ASCII chars) the char points are the same in all charsets. Not the same thing can be said about German umlauts for instance.
How can this be fixed?
Don't forget to set -Dos.encoding
file.encoding system property doesn't have any effect on this issue (contrary to what you'll see in some posts). file.encoding affects only the charset of the stream reader.
No comments:
Post a Comment