Ticket #3307 (new bug)

Opened 8 months ago

Last modified 7 months ago

System.IO and System.Directory functions not Unicode-aware under Unix

Reported by: YitzGale Owned by:
Component: libraries/base Version: 6.11
Keywords: directory unicode Cc:
Operating System: Unknown/Multiple
Test Case: Architecture: Unknown/Multiple
Type of failure:

Description

Under Unix, file paths are represented as raw bytes in a String. That is not user-friendly, because a String is supposed to be decoded Unicode, and it is conventional in Unix to view those raw bytes as encoded according to the current locale. In addition, this is not consistent with Windows, where file paths are natively Unicode and represented as such in the String. (Well, they will be consistently once #3300 is completed.)

On the other hand, this raises various complications (what about encoding errors, and what if encode.decode is not the identity due to normalisation, etc.)

The following cases ought to work consistently for all file operations in System.IO and System.Directory:

  • A FilePath from getArgs
  • A FilePath from getDirectoryContents
  • A FilePath in Unicode from a String literal,
  • A FilePath read from a Handle and decoded into Unicode

See discussion in the thread  http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html

Change History

  Changed 8 months ago by YitzGale

This change needs to be coordinated with #3309 ("getArgs should return Unicode on Unix") so that it will still work to read a file paths from the command line and use them to access files.

in reply to: ↑ description ; follow-up: ↓ 3   Changed 8 months ago by duncan

Replying to YitzGale:

Under Unix, file paths are represented as raw bytes in a String. That is not user-friendly, because a String is supposed to be decoded Unicode, and it is conventional in Unix to view those raw bytes as encoded according to the current locale.

Unfortunately it is not conventional on Unix to interpret file names as Unicode, decoded from the current locale. When presenting file names to the user in a user interface some decoding is necessary, though there is not universal agreement that the locale is the right one. For example glib uses UTF-8 always, unless you set some special env var to tell it to use the current locale (the latter is considered a compatibility hack that will be phased out).

Certainly it's the case that FilePath? as a Haskell String is not accurate for Unix paths (though it is for Windows and OSX). Something more accurate would be (an adt containing) a pair of the original binary filename and a Unicode human readable String decoding of it. It needs both because the decoding may be lossy. On Windows and OSX the binary part would not be needed because they use Unicode natively.

The problem with making getArgs and openFile return Unicode is it may be impossible to open certain files passed on the command line (those for which the decoding is lossy).

I would argue the solution is to move FilePath to being opaque, rather than towards it being properly interpreted as a Haskell Unicode String.

in reply to: ↑ 2 ; follow-up: ↓ 4   Changed 8 months ago by YitzGale

Replying to duncan:

Unfortunately it is not conventional on Unix to interpret file names as Unicode, decoded from the current locale.

AFAIK shells running in all modern vterms and xterms display them this way.

For example glib uses UTF-8 always, unless you set some special env var to tell it to use the current locale (the latter is considered a compatibility hack that will be phased out).

Oh really? Is that because we can soon assume that all locales are UTF-8? If so, it makes our work easier, as Ketil pointed out.

What does Qt do?

Something more accurate would be (an adt...

Yes, a richer type would be a tremendous help. But simonmar has pointed out that it would break H98 compatibility, so it doesn't seem to be an option.

The problem with making getArgs and openFile return Unicode is it may be impossible to open certain files passed on the command line (those for which the decoding is lossy).

On the other hand, they are decoded on other platforms. We don't want to make it impossible to write platform-independent code for any program that reads its args.

Would that actually happen for users using any normal UI and any normal input method? It has always been possible in Unix to create weird file names that are very difficult to deal with, but it won't happen in normal usage. We can provide a Unix-specific hack for the odd case.

in reply to: ↑ 3   Changed 8 months ago by duncan

Replying to YitzGale:

A good reference on what glib does and recommends is here:  http://library.gnome.org/devel/glib/stable/glib-Character-Set-Conversion.html See the description section, after the synopsis.

The problem with making getArgs and openFile return Unicode is it may be impossible to open certain files passed on the command line (those for which the decoding is lossy).

On the other hand, they are decoded on other platforms.

They use the native Unicode representation on other platforms. I don't see that that is an argument to use a non-native representation on Unix platforms.

We don't want to make it impossible to write platform-independent code for any program that reads its args.

Unfortunately as it stands it is impossible for platform-independent code to have both of these properties simultaneously:

  • Read all files passed on the command line
  • Display file names to humans accurately in a user interface.

Currently we get the first property and you're proposing to drop that and switch to the second.

It's pretty well ingrained that FilePath is the type for specifying files eg to open them (it's specified by H98). It's a much more recent problem that we want to display Unicode file names in user interfaces. For portable code, how about we add a function:

filePathToString :: FilePath -> String

On Unix this would decode. On Windows and OSX it'd be the identity since on those platforms the string would already have been decoded.

It means we treat FilePath as if it were an ADT (with differing representation on different platforms) but without actually switching to an opaque type.

Would that actually happen for users using any normal UI and any normal input method?

Generating new names is not a huge problem. The user selects a name in Unicode and if the conversion to a FilePath? is impossible or lossy then the user can be prompted to select a different name. Note that does need another function:

filePathFromString :: String -> Maybe FilePath

It has always been possible in Unix to create weird file names that are very difficult to deal with, but it won't happen in normal usage. We can provide a Unix-specific hack for the odd case.

The most frustrating thing for a user would be selecting a file, having the app read it, but be unable to save back to the exact same file because of lossy decoding. That's why such apps are supposed to save the real file name, and translate that into a string, but they must keep the original name because the decoding can be lossy.

Unfortunately that's not a case we can just provide Unix-specific hacks for, it can happen for almost any portable app. Eg consider apps that translate .foo file into .bar files (like, say a compiler, preprocessor). If we decode filename.foo into Unicode but it's a lossy conversion then saving filename.bar may work, but the file names will no longer correspond which could break things (think chars replaced by '?').

So my suggestion basically is, keep FilePath as a file path, and convert to/from String for human consumption.

  Changed 7 months ago by igloo

  • difficulty set to Unknown
  • milestone set to 6.14.1
Note: See TracTickets for help on using tickets.