Ticket #5218 (new feature request)

Opened 2 years ago

Last modified 2 months ago

Add unpackCStringLen# to create Strings from string literals

Reported by: tibbe Owned by: igloo
Priority: high Milestone: 7.8.1
Component: Compiler Version: 7.0.3
Keywords: Cc: johan.tibell@…, dons, dcoutts, pho@…, lykahb@…, reiner.pope@…, alexey.skladnoy@…, wren@…, patrick@…, hackage.haskell.org@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets: 5877

Description

GHC insert calls to unpackCString# to convert string literals to Strings. Libraries like bytestring use rewrite rules to match on this call to optimize code like pack (unpackCString# s).

If GHC would instead use a version of unpackCString#, say unpackCStringLen#, that includes the (statically known) length, creating ByteStrings from literals could be a constant time operation instead of a linear time operation.

Another use case, which motivated this ticket, is appending string literals to builders (e.g. using Data.Binary.Builder.fromByteString). For small strings the most efficient way to append a string to the builder is to copy the statically allocated string directly into the builder's output buffer. If the string length was known statically, we could do this efficiently using memcpy or even using a small unrolled loop.

Change History

Changed 2 years ago by tibbe

  • cc johan.tibell@… added

Changed 2 years ago by simonmar

  • priority changed from normal to high
  • milestone set to 7.4.1

We really should do this.

Changed 2 years ago by simonmar

Simon and I discussed this just now and found two further problems in addition to the extra runtime length operation described above.

1. The RULES in ByteString for optimising literal strings currently do not work

e.g. try

{-# LANGUAGE OverloadedStrings #-}
module Foo where
import Data.ByteString.Char8
foo = "abc" :: ByteString

the generated code does the translation to/from String at runtime. Presumably this broke at some point, possibly due to things being inlined at the wrong time and/or let-floating.

2. The RULES in ByteString never worked for non-ASCII strings

The RULE in Data.ByteString.Char8 looks like

{-# RULES
"ByteString pack/packAddress" forall s .
   pack (unpackCString# s) = inlinePerformIO (B.unsafePackAddress s)
 #-}

But non-ASCII strings are UTF-8-encoded by GHC and wrapped in unpackCStringUtf8, not unpackCString, so the RULE won't apply. Example:

foo = "\0\0" :: ByteString

This will generate something like

foo = pack (unpackCStringUtf8 "\0\0"#)

We have two solutions.

1. Use quasi-quoting to declare ByteString literals.

You would write something like

foo :: ByteString
foo = [bytes| ffa3c9db77 ]

and the code for bytes would expand thus

  [bytes| ffa3c9db77 ]
===>
  inlinePerformIO (unsafePackAddress "..."#)

ie a call to unsafePackAddress passing a primitive literal string (HsStringPrim, aka "..."#). Note that primitive literal strings are not UTF-8-encoded by GHC, they are just 8-bit stripped (this happens in the lexer, of all places). No use of rules at all.

This would be more convenient for encoding static data into the program through ByteString literals.

2. Fix the RULES for ByteString literals.

Here's one way we could fix all the problems described above. It's a bit tricky and requires changes in various places.

First, somewhere in base or ghc-prim we have

newtype String8 = String8 String

string8 :: String -> String8
{-# INLINE [0] string8 #-}
string8 = String8

unpackCString :: Addr# -> String
unpackCStringUtf8 :: Addr# -> String

unpackCString8Len :: Addr# -> Int# -> String8

Next, GHC internally has RULEs that convert

  string8 (unpackCStringUtf8 <utf8stringlit#>) 
    ===> unpackCString8Len <stringlit#> <len#>

  string8 (unpackCString <asciistringlit#>)
    ===> unpackCString8Len <asciistringlit#> <len#>

where <stringlit#> is made by decoding <utf8stringlit#> and stripping out bits 8 and above from each character.

Then in Data.ByteString.Char8 we have:

module Data.ByteString.Char8 where

packAddressLen :: Addr# -> Int# -> ByteString

pack8 :: String8 -> ByteString

instance IsString ByteString where
  fromString = pack . string8

{-# RULES "pack8/unpackCString8" 
  pack8 (unpackCString8 addr len) = packAddressLen addr len
 #-}

Changed 2 years ago by simonmar

  • cc dons, dcoutts added

Changed 2 years ago by tibbe

If solution two works I'm for it. It feels a bit complex but perhaps there's no better option. I'm against solution 1 as it would make most Haskell programs Template Haskell programs, as byte string literals are quite common, and that feels a bit too much to solve this problem.

If we could have proper byte string literals (ala Python's b"...") that would be even better but I guess that would be a too invasive change.

Changed 2 years ago by simonpj

What's wrong with Template Haskell? The solution is so simple, it seems a shame not to use it, no?

Changed 2 years ago by simonmar

I think we want to do both - the two solutions are complementary.

For large ByteString? literals, such as when you've serialised and gzipped a data structure for unpacking at runtime, the quasiquotation syntax makes perfect sense.

However, some people want to write ByteString? literals using string syntax, and to not have to use TH just for one of these small literals (TH is seen as a heavyweight dependency if you haven't already bought in).

Changed 2 years ago by duncan

Solution 2 looks good to me.

As tibbe says, solution 1 would also be useful in other use cases. Hex, octal or bit string literals.

And if we ever switch Text to use UTF8 then unpackCStringUtf8Len# would be useful there too.

Changed 19 months ago by PHO

  • cc pho@… added

Changed 18 months ago by boris

  • cc lykahb@… added

Changed 18 months ago by igloo

  • milestone changed from 7.4.1 to 7.6.1

punting

Changed 18 months ago by PHO

I want both solutions too, especially the solution 1.

What concerns me is that there seems no means of creating primitive byte-array literals with TH. That is, the  Lit type currently only has a constructor StringPrimL String which represents an Addr# literal encoded in UTF-8, thus unsafePackAddressLen 3 "\NUL\NUL\NUL"# works but unsafePackAddressLen 3 $(litE $ StringPrimL "\NUL\NUL\NUL") doesn't. So we probably need to make a change to the type of StringPrimL:

data Lit = CharL Char
         | StringL String
         | ...
         | StringPrimL [Word8] -- Raw, non-encoded "..."# literal.

Changed 17 months ago by reinerp

  • cc reiner.pope@… added

Changed 15 months ago by reinerp

  • related set to 5877

See also #5877 which fixed the Template Haskell point above

Changed 15 months ago by simonpj

  • owner set to igloo
  • difficulty set to Unknown

Assigning to Ian to investigate/propose

Changed 14 months ago by Khudyakov

  • cc alexey.skladnoy@… added

Changed 11 months ago by igloo

  • milestone changed from 7.6.1 to 7.8.1

Punting

Changed 8 months ago by WrenThornton

  • cc wren@… added

Changed 8 months ago by parcs

  • cc patrick@… added

Changed 7 months ago by simonpj

Changed 2 months ago by liyang

  • cc hackage.haskell.org@… added
Note: See TracTickets for help on using tickets.