Ticket #5255 (closed feature request: wontfix)

Opened 2 years ago

Last modified 23 months ago

String literals cause runtime crashes when OverloadedStrings is in effect

Reported by: YitzGale Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.0.3
Keywords: Cc: pho@…, hackage.haskell.org@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description (last modified by simonpj) (diff)

There has been a discussion[1] on the web-devel list about the fate of the IsString instance for Name in the xml-types library[2]. A Name is the name of an XML element or attribute.

That instance calls error when the string contains a certain kind of invalid domain-specific syntax. Some are even advocating expanding this behavior to any string that is syntactically invalid for XML names.

So we now have GHC as the only major compiler which can cause *runtime* crashes depending on what characters are used in a string literal.

OverloadedStrings as a more general mechanism is very convenient in many settings. One of them is XML names; another is attoparsec-text[3] parsers. I must admit I have succumbed to the temptation of making this deal with the devil and benefiting from them.

But when used this way OverloadedStrings is really just another syntax for quasi-quotation, and that is what should have been used explicitly instead of these unsafe domain-specific IsString instances.

I propose fixing the problem in one of the following ways:

A. Make string literals syntax in fact a specialized quasi-quotation when OverloadedStrings is turned on. That way, exceptions are caught at compile time as they should be.

B. Bless Text, and possibly ByteString, as the only types that get magical behavior of string literals.

C. Remove OverloadedStrings altogether.

Option A is by far the nicest. But it requires GHC to know the type of the string literal before the cast is applied. We might also need some way to help GHC find the cast function at the right time, beyond just having an IsString instance somewhere in scope.

By submitting this bug, I am making it clear that I am opposed to Option D, leaving things the way they are and wishing everyone the best of luck. The OverloadedStrings pragma is not really optional anymore now that Text is becoming the default string type in practice for Haskell. It is not acceptable to have to wrap every string awkwardly with (T.pack "") and give up the chance of it being CAFfed. In fact, the blaze-html[4] library relies on OverloadedStrings for its performance[5].

I am also opposed, though less so, to providing a deprecation route by using a new language pragma for Option A or B. The current behavior is dangerous and should be summarily removed.

Change History

  Changed 2 years ago by YitzGale

One issue to point out here is that my approach does not allow polymorphic string literals.

We need a safe mechanism for string literals that includes both String and Text. If that mechanism can allow other types too, all the better, but it needs to be semantically sound.

If anyone strongly feels the need for runtime resolution of string literals, which allows polymorphic literals but also runtime crashes, that should be a separate mechanism.

It is interesting to note that this problem hasn't arisen in practice for numeric literals. I think it is a combination of the difficulty of writing a Num instance, the limited syntax of a numeric literal, and the intuitive semantics of a number. Whereas for string literals, it is just too tempting for people to abuse them as a poor man's quasi-quoter, thus polluting the semantic space of literals in Haskell.

follow-up: ↓ 9   Changed 2 years ago by duncan

Closely related is that for ByteString? it'd be nice to enforce that the chars are ASCII only. Currently it is possible to use unicode literals with ByteString? and the result you get is not defined (I think it will actually vary depending on -O0 vs -O1 due to RULES that bypass B.pack).

follow-up: ↓ 4   Changed 2 years ago by igloo

I think it's actually quite common to make bad Num instances, e.g. making a Seconds type an instance so that you can add Seconds together, but making (*) an error.

in reply to: ↑ 3 ; follow-up: ↓ 5   Changed 2 years ago by YitzGale

Replying to igloo:

I think it's actually quite common to make bad Num instances... making (*) an error.

True. That is a related but different problem. It is a problem with the design of the Num class. At least (*) is a function. It shouldn't fail at runtime either, but when it does, that is not as bad as a *literal* failing at runtime.

in reply to: ↑ 4   Changed 2 years ago by cdsmith

Replying to YitzGale:

True. That is a related but different problem. It is a problem with the design of the Num class. At least (*) is a function.

I'm not sure it's a different problem at all. In fact, it's very easy to produce a Num instance such that numeric literals cause runtime errors as well:

data Nat = Z | S Nat deriving (Show, Eq)
instance Num Nat where
    fromInteger n | n < 0     = error "Negative Nat"
                  | n == 0    = Z
                  | otherwise = S (fromInteger (n-1))

This is directly analogous to the issue you mention with strings.

follow-up: ↓ 10   Changed 2 years ago by jmillikin

I'm the author of the xml-types in question. The IsString instance was added at the request of Michael Snoyman, who wanted a syntactically-easy way to define element names. I agree with him that it is useful, but feel it should be regarded as trivial syntactic sugar in the vein of Data.ByteString.Char8.

It would be very useful for the compiler to support some kind of string literal validation, since there are currently several ways that bad literals can cause unexpected runtime failure:

1) ByteString? literals can be constructed with characters where ord(c) > 255

2) Text literals can be constructed with characters that are not valid Unicode.

3) Int, Word, and their variants can be constructed from literals that exceed their bounds.

IMO, at minimum throwing a runtime exception for these cases would be better than the current behavior, which is to return some arbitrary garbage value. Failing loudly and obviously saves much debugging time when the value is subtly wrong.

Using quasi-quoters are not a good solution, as they drag in the whole horrific mess that is Template Haskell. They're also not much better than simply using existing constructors:

-- I like this style
name :: Text -> Text -> Name
name namespace local = Name local (Just namespace) Nothing

someName :: Name
someName = name "foo" "bar"

-- The current IsString enables this style
someName :: Name
someName = "{foo}bar"

-- A quasi-quoter would look something like this, at best
someName :: Name
someName = $(qqName "{foo}bar")

(side note: Calling this a "crash" is silly, since nothing crashes, and the error is reported as a standard exception. The word "crash" should be reserved for actual crashes, such as peek nullPtr).

  Changed 2 years ago by simonpj

  • description modified (diff)

follow-up: ↓ 11   Changed 2 years ago by simonpj

  • description modified (diff)

I'm having trouble following the details of this discussion; some examples would help.

As I understand it, what you want is to be able to write the Haskell expression

 "<foo>mumble</blah>" :: XML

and have a compile-time error saying that the string is ill-formed. Is that right.

If so, the solution is to hand, in the form of quasi-quotation (http://www.haskell.org/ghc/docs/7.0-latest/html/users_guide/template-haskell.html#th-quasiquotation), Geoff Mainland's enhancement to Template Haskell. You say

  [xml| <foo>mumble</blah> |]

GHC runs the xml quasi-quoter (which you write, or provide in your library) and it can check the string. Moreover, the quasi-quoter monad hooks into GHC's error reporting machinery, so you can report errors just as if they came from GHC. Moreover, you can put these quasi-quotes in patterns, or types, or declarations.

This seems so easy and so natural that I'm not sure what else you might want. Maybe I am missing the point. I suppose you might not like the concrete syntax; improvements welcome. But somewhere you have to say what the parser is.

Simon

PS: Is Template Haskell really a "horrific mess"?

in reply to: ↑ 2   Changed 2 years ago by simonmar

Replying to duncan:

Closely related is that for ByteString? it'd be nice to enforce that the chars are ASCII only. Currently it is possible to use unicode literals with ByteString? and the result you get is not defined (I think it will actually vary depending on -O0 vs -O1 due to RULES that bypass B.pack).

Actually the RULEs for ByteString? literals won't apply to non-ASCII string literals, because GHC uses a different desugaring for these. For a detailed description of the current (sorry) state, see #5218.

in reply to: ↑ 6   Changed 2 years ago by simonmar

  • type changed from bug to feature request

Replying to jmillikin:

(side note: Calling this a "crash" is silly, since nothing crashes, and the error is reported as a standard exception. The word "crash" should be reserved for actual crashes, such as peek nullPtr).

Hear hear. I panicked when I saw the word "crash" in this bug report, only to find that GHC is working exactly as it is supposed to.

This ticket is not a bug. It's not really a feature request either, since there's no agreement yet on what changes should be made (if any), but I'm changing it to a feature request so the discussion can continue here if you like. It might be better to take this somewhere else though (e.g. ghc-users).

in reply to: ↑ 8   Changed 2 years ago by snoyberg

Replying to simonpj:

I'm having trouble following the details of this discussion; some examples would help. As I understand it, what you want is to be able to write the Haskell expression {{{ "<foo>mumble</blah>" :: XML }}} and have a compile-time error saying that the string is ill-formed. Is that right. If so, the solution is to hand, in the form of quasi-quotation (http://www.haskell.org/ghc/docs/7.0-latest/html/users_guide/template-haskell.html#th-quasiquotation), Geoff Mainland's enhancement to Template Haskell. You say {{{ [xml| <foo>mumble</blah> |] }}}

We're not using IsString? for the XML datatype, but rather for the Name datatype. An element or attribute name is composed of a local name, a namespace and a prefix, resulting in the datatype:

data Name = Name { local :: Text, namespace :: Maybe Text, prefix :: Maybe Text }

The idea of the current IsString? instance for Name is to be able to encode both the namespace and the local name simultaneously, via "clark notation." In other words, "foo" => Name "foo" Nothing Nothing, and "{bar}foo" => Name "foo" (Just "bar") Nothing. The problem arises when the programmer enters the string literal "{barfoo", which currently errors out.

The reason we'd like to avoid using a quasi-quoter is that it's simply much bigger syntax overhead than a string literal. My code base makes use of the IsString? instance all over the place, and having to turn "{bar}foo" into [name|{bar}foo|] everywhere would just be inconvenient.

  Changed 2 years ago by simonpj

OK, so this is largely a question of syntactic overhead.

I still think TH is the right mechanism, regardless of the syntax

  • TH is GHC's only mechanism for guaranteeing compile time evaluation
  • It hooks into the error reporting monad, so you get proper civilised error messages with decent line numbres. Something based on "evaluate (packBS "foo{name")" at compile time will not.
  • Adding interaction with overloading is even worse. What is supposed to happen for
    f :: IsString s => [s]
    f = ["foo", "bar{n"]
    
    You can't do this statically.

That said, I don't see a way to achieve what you want with zero syntactic overhead.

Still, if you can think of a quieter notation for quasiquoting, sing out.

  Changed 2 years ago by cdsmith

Just copying this from elsewhere. I'm not necessarily in favor of this idea, but I think it would be a valid, semantics-preserving transformation to have the compiler speculatively try to evaluate string (and numeric, for that matter) literals at compile time, and just give up if:

* The code ran for too long. * The code used unsafePerformIO or something similar.

At its root, this would be nothing more than an optimization -- evaluating a complex expression at compile time. But there could be a warning for cases when you gave up due to the computation taking too long or doing unsafePerformIO tricks, or where it's known that the literal is bottom.

I'm not saying I think we *should* do this. But I'm saying I think it would beat adding another order of magnitude of complexity to the semantics of literals by making them depend on TH and quasiquoting.

  Changed 2 years ago by PHO

  • cc pho@… added

  Changed 23 months ago by liyang

  • cc hackage.haskell.org@… added

  Changed 23 months ago by igloo

  • status changed from new to closed
  • resolution set to wontfix

I think the consensus is to do nothing, so I'll close this ticket.

Note: See TracTickets for help on using tickets.