Ticket #145 (closed defect: fixed)

Opened 6 years ago

Last modified 17 months ago

Hackage does not support UTF-8 characters

Reported by: guest Owned by:
Priority: normal Milestone: Cabal-1.4
Component: hackageDB website Version: 1.1.6
Severity: normal Keywords:
Cc: Difficulty: easy (<4 hours)
GHC Version: 6.4.2 Platform:

Description

I checked a package with UTF-8 characters in my name (Marco Túlio), and it showed it as Marco Túlio, every time it appeared.

Change History

Changed 5 years ago by duncan

  • difficulty changed from normal to easy (<4 hours)
  • component changed from HackageDB website to Cabal library
  • platform Linux deleted
  • milestone set to Cabal-1.4

Changed 5 years ago by duncan

So to do this properly we have to consider unicode everywhere in the .cabal file parser. What is the best strategy?

Some fields want to be ascii only, like package names, dependencies etc. Others are totally free form.

Probably the right thing to do is to read the .cabal file as utf8 before parsing. Then for those fields that should be ascii only we should parse as we do now and then do a check afterwards and complain about chars we do not allow. That way we get better error messages.

However if we're decoding from UTF8 then we have the problem that we have to re-encode when printing, eg in error messages and when writing files.

Changed 5 years ago by duncan

  • component changed from Cabal library to HackageDB website

So the Cabal lib now does read .cabal files as UTF8. Is there anything left to do in the website code?

Changed 5 years ago by guest

/packages/archive/pkg-list.html is potentially broken because it's served as utf-8 (being static) when it's intended to be viewed as latin1. Dynamic pages are instead correctly served as latin1 because that's the default for runCGI. This derives from the behaviour of the HTML Char instance in the xhtml package: it expects unicode codepoints and escapes chars >= 255 with a numerical escape. (note that the first 255 codepoints of unicode correspond to latin1)

Hence it works fine as long as everything is served as latin1, so we should either set that in the webserver configuration for static pages or directly in the page itself.

Or is it maybe better to produce utf8 pages?

Changed 5 years ago by ross@…

We're constrained by the behaviour of Text.XHtml here. I've added a meta tag to declare the charset that it's using, but it might be safer for Text.XHtml to escape everything over 127. Alternatively, if we were to serve UTF-8, the cgi package would be the appropriate place to do that.

Changed 5 years ago by ross

  • status changed from new to closed
  • resolution set to fixed

Text.XHtml has been changed in darcs to escape everything over 127, and the scripts now use that version, so utf-8 is now fine.

Changed 17 months ago by elga

Note: See TracTickets for help on using tickets.