Ticket #3309 (closed bug: fixed)

Opened 4 years ago

Last modified 2 years ago

getArgs should return Unicode on Unix

Reported by: YitzGale Owned by: batterseapower
Priority: high Milestone: 7.2.1
Component: libraries/base Version: 6.11
Keywords: unicode Cc: slyfox@…, marcot@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

The raw bytes of args should be decoded according to the current locale.

An additional function should be added:

getArgsBytes :: IO [Word8]

to provide access to the raw bytes.

This change needs to be coordinated with #3007 so that it will still work to read a file name from the command line args and use it to access a file.

This change should also be made on Windows: #3008

See the discussion at  http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html

Change History

  Changed 4 years ago by igloo

  • difficulty set to Unknown
  • milestone set to 6.14.1

in reply to: ↑ description   Changed 3 years ago by slyfox

  • failure set to None/Unknown

Replying to YitzGale:

The raw bytes of args should be decoded according to the current locale. An additional function should be added: {{{ getArgsBytes :: IO [Word8] }}}

s/\[Word8\]/\[\[Word\]\]/ :]

to provide access to the raw bytes. This change needs to be coordinated with #3007 so that it will still work to read a file name from the command line args and use it to access a file. This change should also be made on Windows: #3008 See the discussion at  http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html

Or, maybe, make getArgs/readFile and friends polymorphic like Text.Printf printf does?

Text.Printf printf :: PrintfType r => String -> r

instance (IsChar c) => PrintfType [c] -- Defined in Text.Printf
instance PrintfType (IO a) -- Defined in Text.Printf
instance (PrintfArg a, PrintfType r) => PrintfType (a -> r)

In our case it would be something like

getArgs :: StringAlike s => IO [s]

and usage would look like:
foo = getArgs :: [[Word8]] -- raw bytes
foo = getArgs :: [ByteString]  -- raw bytes in fast bytestring
foo = getArgs :: [String]  -- locale encoded
-- maybe, anothers?

Thanks!

  Changed 3 years ago by slyfox

  • cc slyfox@… added

  Changed 2 years ago by igloo

  • milestone changed from 7.0.1 to 7.0.2

  Changed 2 years ago by marcotmarcot

  • cc marcot@… added

The same applies to System.Environment.getEnvironment.

  Changed 2 years ago by igloo

  • milestone changed from 7.0.2 to 7.2.1

  Changed 2 years ago by batterseapower

I have a patch to add locale-awareness to the CString functions in Foreign.C.String, which fixes this problem, but I have a problem: The documentation for charIsRepresentable claims that unrepresentable characters are replaced with ?, but the current code does not in fact do this - you get a nonsense character instead. Furthermore, it is difficult to fix the code to match the documentation in my new locale-aware implementation because iconv only provides transliteration and ignore modes for unrepresentable characters.

So there are two problems: 1. The documented behaviour on unrepresentable characters does not match the implemented behaviour 2. The documented behaviour is difficult to implement

So we should probably change the documented behaviour. The easiest thing to do is drop unrepresentable characters, which can be implemented easily either using our code page decoder (on Win32) or iconv (on *nix).

Does this sound like a reasonable approach?

follow-up: ↓ 9   Changed 2 years ago by simonmar

Are you planning to make peekCString and friends do decoding by default? I have a horrible feeling that will break lots of things. I know it's what the FFI spec requires, but since we've never done it, changing the behaviour now could be surprising.

I've no objection to your proposal for unrepresentable chars, provided we document it appropriately.

in reply to: ↑ 8 ; follow-up: ↓ 10   Changed 2 years ago by ross

Replying to simonmar:

Are you planning to make peekCString and friends do decoding by default? I have a horrible feeling that will break lots of things. I know it's what the FFI spec requires, but since we've never done it, changing the behaviour now could be surprising.

This behaviour has been specified by the FFI spec since 2002, and was incorporated into Haskell 2010. The documentation of the module has been promising this change since 2004, and in all that time the alternative CAString versions have been available, so it's probably not too hasty to implement it now. I fear you're right about breakage, but it has to happen some time.

in reply to: ↑ 9   Changed 2 years ago by simonmar

Replying to ross:

This behaviour has been specified by the FFI spec since 2002, and was incorporated into Haskell 2010. The documentation of the module has been promising this change since 2004, and in all that time the alternative CAString versions have been available, so it's probably not too hasty to implement it now. I fear you're right about breakage, but it has to happen some time.

Then I fear we will all need to brace for impact before the next major release :-)

  Changed 2 years ago by Athas

Has there been any further work on this issue? I'm willing to help out (with testing/hacking) if necessary.

  Changed 2 years ago by igloo

  • owner set to batterseapower
  • priority changed from normal to high

If we're going to do this, we should do it as soon as possible; as ross says, any breakage has to happen some time, and it's only going to get worse if we leave it. So I'll make it high priority for 7.2.1.

batterseapower, are you happy to take the lead on this?

  Changed 2 years ago by batterseapower

Yes, I am going to get my patches in - I was away in China for 2 weeks or I would have already moved this forward.

  Changed 2 years ago by batterseapower

  • status changed from new to closed
  • resolution set to fixed

Fixed by 509f28cc93b980d30aca37008cbe66c677a0d6f6 to base.

Note: See TracTickets for help on using tickets.