Ticket #3977 (new feature request)

Opened 3 years ago

Last modified 10 days ago

Support double-byte encodings (Chinese/Japanese/Korean) on Windows

Reported by: shelarcy Owned by: batterseapower
Priority: low Milestone: 7.8.1
Component: libraries/base Version: 6.13
Keywords: Cc: shelarcy@…
Operating System: Windows Architecture: Unknown/Multiple
Type of failure: Incorrect result at runtime Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets: #5754

Description

localeEncoding uses the console code page for text file encoding/decoding for single-byte encoding environment on Windows. But GHC.IO.Encoding.CodePage?.Table doesn't have double-byte encodings (Chinese/Japanese/Korean), now. Its current state often causes problem on double-byte encoding environment.

I know we can solve problem by using hSetEncoding with utf8 or othere UTF-* encodings. But it's not good solution.

According to previous Windows patch, GHC.IO.Encoding.CodePage?.Table doesn't support double-byte encodings because Windows' shared library support doesn't work.

Currently we do not support double-byte encodings (Chinese/Japanese/Korean), since
including those codepages would increase the table size to 400KB.  It will be
straightforward to implement them once the work on library DLLs is finished.

I think Windows' shared library support works now. Because #3879 is closed.

So, how about add supporting double-byte encodings (Chinese/Japanese/Korean) on Windows?

Change History

  Changed 3 years ago by shelarcy

  • version changed from 6.12.1 to 6.13

  Changed 3 years ago by igloo

  • milestone set to 6.14.1

  Changed 2 years ago by igloo

  • milestone changed from 7.0.1 to 7.0.2

  Changed 2 years ago by igloo

  • milestone changed from 7.0.2 to 7.2.1

  Changed 2 years ago by batterseapower

We could fall back on MultiByteToWideChar? for code pages that CodePage?.hs doesn't know about. This won't bloat the binaries at all, though it can involve more overhead than our native-Haskell CodePage? decoder if the character size is not 2 bytes (i.e. something other than UTF-16 encoded Stings).

  Changed 2 years ago by simonmar

I don't know how to use MultiByteToWideChar to implement TextEncoding, other than doing binary-chop when it fails. This is why we had to make our own CodePage? implementations. Some more details here:

 http://www.haskell.org/pipermail/libraries/2009-July/012077.html

follow-up: ↓ 8   Changed 2 years ago by batterseapower

I see. Still, having a dog-slow binary chop for these double-byte encodings is better than having no support at all.

in reply to: ↑ 7   Changed 2 years ago by simonmar

Replying to batterseapower:

I see. Still, having a dog-slow binary chop for these double-byte encodings is better than having no support at all.

Maybe. It would be tricky to get working though, and getting the error semantics right might be a pain (if it's possible at all). Are double-byte encodings widely used?

  Changed 2 years ago by batterseapower

IME it is very common for PCs in China to be using one of the double-byte code page settings, since there is still a ton of software out there that uses the legacy Windows APIs. I imagine the situation is the same in Japan/Korea but I have no direct experience.

Of course, even if DBCS is commonly used the localeEncoding is not as useful on Windows as it is on *nix in the context of GHC since we will mostly just call into the *W APIs and thus sidestep the locale entirely... which probably makes this ticket a low priority. In particular my upcoming patch set to implement PEP383 behaviour should support CJK without having a double-byte-aware localeEncoding.

  Changed 21 months ago by shelarcy

  • cc shelarcy@… removed

  Changed 21 months ago by shelarcy

  • cc shelarcy@… added

Oops, I mistook operation. Sorry for noise.

  Changed 20 months ago by igloo

  • milestone changed from 7.2.1 to 7.4.1

  Changed 17 months ago by shelarcy

  • related set to #5754

  Changed 15 months ago by igloo

  • priority changed from normal to low
  • milestone changed from 7.4.1 to 7.6.1

  Changed 8 months ago by igloo

  • milestone changed from 7.6.1 to 7.6.2

  Changed 4 weeks ago by batterseapower

  • owner set to batterseapower

Fixed on the dbcs branch of the base library.

  Changed 10 days ago by igloo

  • difficulty set to Unknown
  • milestone changed from 7.6.2 to 7.8.1

Is this now fixed in HEAD by this?:

commit f982978eaa5d74c5dffe71c14a1555587b6a5e48
Author: Max Bolingbroke <batterseapower@hotmail.com>
Date:   Thu Apr 18 21:29:08 2013 +0100

    Support for Windows DBCS and new SBCS with MultiByteToWideChar
Note: See TracTickets for help on using tickets.