Ticket #4471 (new bug)

Opened 3 years ago

Last modified 3 months ago

Incorrect Unicode output on Windows Console

Reported by: sankeld Owned by:
Priority: low Milestone: 7.6.2
Component: Compiler Version: 6.12.3
Keywords: Cc: ekmett@…, dagitj@…, simon@…, shelarcy@…
Operating System: Windows Architecture: x86
Type of failure: Incorrect result at runtime Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description

To reproduce,

  • start a windows console
  • Change the console's font to a ttf unicode font, like "Lucida Console".
  • Type "chcp 65001" to set it to the UTF-8 code page.

test.hs

main = putStrLn "∷⇒∀→←⋯⊢"

Output to the console is garbled. runghc test.hs:

∷⇒∀→←⋯⊢
→←⋯⊢
⋯⊢
∷⇒∀→←⋯⊢→←⋯⊢←⋯⊢⋯⊢⊢⊢⊢<stdout>: hFlush: permission denied (Permission denied)

Piping works correctly. runghc test.hs > output && type output:

∷⇒∀→←⋯⊢

ghci fails. ghci test.hs

GHCi, version 6.12.3: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 1] Compiling Main             ( test.hs, interpreted )
Ok, modules loaded: Main.
*Main> main
∷*** Exception: <stdout>: hPutChar: permission denied (Permission denied)
*Main>

Change History

  Changed 3 years ago by sankeld

  Changed 3 years ago by sankeld

  Changed 3 years ago by sankeld

A solution that doesn't require changing from the posix emulation layer is shown [here  http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx]

test.c

#include <fcntl.h>
#include <io.h>
#include <stdio.h>
#include <unistd.h>

int main()
{
    //this seems to fix the problem
    _setmode(_fileno(stdout), _O_U8TEXT );
    char testStr[] = "∷⇒∀→←⋯⊢";
    //posix emulation
    write( STDOUT_FILENO, testStr, strlen(testStr) );
    return 0;
}

gcc test.c -o test.exe test.exe

∷⇒∀→←⋯⊢

test.exe > output && type output

∷⇒∀→←⋯⊢

  Changed 3 years ago by simonmar

Surely that solution only works for UTF-8? What about other code pages?

follow-up: ↓ 6   Changed 3 years ago by sankeld

Change to the Greek code page, chcp 1253 test.exe

∷⇒∀→←⋯⊢

_setmode then is a solution only when the console is set to the Unicode code page. This seems like an adequate solution for now, no?

Here is the link for the _setmode documentation:

 http://msdn.microsoft.com/en-us/library/tw4k6df8.aspx

in reply to: ↑ 5 ; follow-up: ↓ 8   Changed 3 years ago by simonmar

Replying to sankeld:

_setmode then is a solution only when the console is set to the Unicode code page. This seems like an adequate solution for now, no? Here is the link for the _setmode documentation:  http://msdn.microsoft.com/en-us/library/tw4k6df8.aspx

I don't like to apply a fix without fully understanding what the problem is and why the fix works, and this is all very mysterious to me right now. Why doesn't it work to send UTF-8 to stdout if the current code page is set to UTF-8?

  Changed 3 years ago by igloo

  • milestone set to 7.2.1

in reply to: ↑ 6   Changed 3 years ago by sankeld

Replying to simonmar:

Replying to sankeld:

_setmode then is a solution only when the console is set to the Unicode code page. This seems like an adequate solution for now, no? Here is the link for the _setmode documentation:  http://msdn.microsoft.com/en-us/library/tw4k6df8.aspx

I don't like to apply a fix without fully understanding what the problem is and why the fix works, and this is all very mysterious to me right now. Why doesn't it work to send UTF-8 to stdout if the current code page is set to UTF-8?

I understand your hesitation. I carefully read through the documentation linked there and on the blog post I mentioned. The only thing Microsoft is putting out right now is the "how" and not the "why" unfortunately.

I don't have high hopes we'll be able to get beyond speculation as to why the default console mode produces unexpected and unpredictable unicode console output.

One thing we can note is that the mention of _O_U16TEXT, _O_U8TEXT, and _O_WTEXT in the _setmode documentation is a recent addition (vs 2010), although they worked prior. This may be an indicator that Microsoft is "blessing" this workaround for the console.

  Changed 3 years ago by simonmar

There are still too many unknowns here.

  • Won't _O_U8TEXT do newline mangling too? The IO library already does that, so we could have a problem.
  • the original report said that piping the output to a file worked fine. So presumably we need to do this only when the file descriptor is attached to a console?

And I still don't understand exactly what this _setmode is a workaround for. Something apparently goes wrong when you try to output Unicode to the console, but at what layer does the problem occur? (GHC.IO, msvcrt, Win32, kernel)

I don't like to be obstructive when there's an apparent fix for a problem, but I've seen many cases where a "fix" has introduced new problems, so I want to make sure the cure is not worse than the disease :)

  Changed 3 years ago by sankeld

I think I have the bug pinpointed and can explain the behavior of the original test program.

I've verified that the posix write system call (when applied to stdout where stdout is attached to a console with code page 65001) returns the number of *characters* written instead of the number of *bytes*. This can probably be traced to  this issue.

The reasoning for our original output

∷⇒∀→←⋯⊢ -- outputs correctly, but runtime thinks that 9/15 characters remain
→←⋯⊢ -- runtime tries to output the remaining characters, but still thinks characters remain.
⋯⊢ -- ...and so on until a buffer overrun I assume.
∷⇒∀→←⋯⊢→←⋯⊢←⋯⊢⋯⊢⊢⊢⊢<stdout>: hFlush: permission denied (Permission denied)

The GHC/IO/FD.hs's fdWrite function source confirms this behavior.

An ugly solution, if we want to work around this write bug, would be to check, upon write, if this is a 65001 console (not piped to a file). If so, treat the return value of write as a number of characters instead of a number of bytes.

Arg.

  Changed 3 years ago by sankeld

Also, looking at comments  here, there would also have to be a check of whether or not the console is using a ttf font. This workaround strategy is beginning to look like a dead end.

  Changed 2 years ago by simonmar

That does clarify things a lot, thanks for that. To summarise:

  • the bug is that Win32 WriteFile() returns the wrong result when writing to a Console in codepage 65001. Furthermore, the result it actually returns is the number of characters written to the console, which depends on the actual font being used! (if the font doesn't have the required Unicode glyph, it falls back to outputting characters corresponding to the raw UTF-8 bytes).

The only way to work around the bug seems to be to use WriteConsole() and write Unicode characters directly. If a Handle is attached to a console, then all writes must be decoded from the codepage encoding to UTF-16 before being written using WriteConsole(). Even better would be to bypass the codepage encoding entirely and encode directly from UTF-32 to UTF-16 in the IO library. None of this is particularly easy, though.

  Changed 16 months ago by igloo

  • priority changed from normal to low
  • milestone changed from 7.4.1 to 7.6.1

  Changed 8 months ago by igloo

  • milestone changed from 7.6.1 to 7.6.2

  Changed 4 months ago by ekmett

  • cc ekmett@… added

  Changed 3 months ago by dagit

  • cc dagitj@… added

  Changed 3 months ago by simonmic

  • cc simon@… added

  Changed 3 months ago by shelarcy

  • cc shelarcy@… added
Note: See TracTickets for help on using tickets.