core-text-0.3.8.0: A rope type based on a finger tree over UTF-8 fragments
Safe HaskellNone
LanguageHaskell2010

Core.Text.Rope

Description

If you're accustomed to working with text in almost any other programming language, you'd be aware that a "string" typically refers to an in-memory array of characters. Traditionally this was a single ASCII byte per character; more recently UTF-8 variable byte encodings which dramatically complicates finding offsets but which gives efficient support for the entire Unicode character space. In Haskell, the original text type, String, is implemented as a list of Char which, because a Haskell list is implemented as a linked-list of boxed values, is wildly inefficient at any kind of scale.

In modern Haskell there are two primary ways to represent text.

First is via the [rather poorly named] ByteString from the bytestring package (which is an array of bytes in pinned memory). The Data.ByteString.Char8 submodule gives you ways to manipulate those arrays as if they were ASCII characters. Confusingly there are both strict (Data.ByteString) and lazy (Data.ByteString.Lazy) variants which are often hard to tell the difference between when reading function signatures or haddock documentation. The performance problem an immutable array backed data type runs into is that appending a character (that is, ASCII byte) or concatonating a string (that is, another array of ASCII bytes) is very expensive and requires allocating a new larger array and copying the whole thing into it. This led to the development of "builders" which amortize this reallocation cost over time, but it can be cumbersome to switch between Builder, the lazy ByteString that results, and then having to inevitably convert to a strict ByteString because that's what the next function in your sequence requires.

The second way is through the opaque Text type of Data.Text from the text package, which is well tuned and high-performing but suffers from the same design; it is likewise backed by arrays. (Historically, the storage backing Text objects was encoded in UTF-16, meaning every time you wanted to work with unicode characters that came in from anywhere else and which inevitably were UTF-8 encoded they had to be converted to UTF-16 and copied into a further new array! Fortunately Haskell has recently adopted a UTF-8 backed Text type, reducing this overhead. The challenge of appending pinned allocations remains, however.)

In this package we introduce Rope, a text type backed by the 2-3 FingerTree data structure from the fingertree package. This is not an uncommon solution in many languages as finger trees support exceptionally efficient appending to either end and good performance inserting anywhere else (you often find them as the backing data type underneath text editors for this reason). Rather than Char the pieces of the rope are ShortText from the text-short package, which are UTF-8 encoded and in normal memory managed by the Haskell runtime. Conversion from other Haskell text types is not O(1) (UTF-8 validity must be checked, or UTF-16 decoded, or...), but in our benchmarking the performance has been comparable to the established types and you may find the resultant interface for combining chunks is comparable to using a Builder, without being forced to use a Builder.

Rope is used as the text type throughout this library. If you use the functions within this package (rather than converting to other text types) operations are quite efficient. When you do need to convert to another type you can use fromRope or intoRope from the Textual typeclass.

Note that we haven't tried to cover the entire gamut of operations or customary convenience functions you would find in the other libraries; so far Rope is concentrated on aiding interoperation, being good at appending (lots of) small pieces, and then efficiently taking the resultant text object out to a file handle, be that the terminal console, a file, or a network socket.

Synopsis

Rope type

data Rope Source #

A type for textual data. A rope is text backed by a tree data structure, rather than a single large continguous array, as is the case for strings.

There are three use cases:

Referencing externally sourced data

Often we interpret large blocks of data sourced from external systems as text. Ideally we would hold onto this without copying the memory, but (as in the case of ByteString which is the most common source of data) before we can treat it as text we have to validate the UTF-8 content. Safety first. We also copy it out of pinned memory, allowing the Haskell runtime to manage the storage.

Interoperating with other libraries

The only constant of the Haskell universe is that you won't have the right combination of {strict, lazy} × {Text, ByteString, String, [Word8], etc} you need for the next function call. The Textual typeclass provides for moving between different text representations. To convert between Rope and something else use fromRope; to construct a Rope from textual content in another type use intoRope.

You can get at the underlying finger tree with the unRope function.

Assembling text to go out

This involves considerable appending of data, very very occaisionally inserting it. Often the pieces are tiny. To add text to a Rope use the appendRope method as below or the (<>) operator from Data.Monoid (like you would have with a Builder).

Output to a Handle can be done efficiently with hWrite.

Instances

Instances details
Eq Rope Source # 
Instance details

Defined in Core.Text.Rope

Methods

(==) :: Rope -> Rope -> Bool #

(/=) :: Rope -> Rope -> Bool #

Ord Rope Source # 
Instance details

Defined in Core.Text.Rope

Methods

compare :: Rope -> Rope -> Ordering #

(<) :: Rope -> Rope -> Bool #

(<=) :: Rope -> Rope -> Bool #

(>) :: Rope -> Rope -> Bool #

(>=) :: Rope -> Rope -> Bool #

max :: Rope -> Rope -> Rope #

min :: Rope -> Rope -> Rope #

Show Rope Source # 
Instance details

Defined in Core.Text.Rope

Methods

showsPrec :: Int -> Rope -> ShowS #

show :: Rope -> String #

showList :: [Rope] -> ShowS #

IsString Rope Source # 
Instance details

Defined in Core.Text.Rope

Methods

fromString :: String -> Rope #

Generic Rope Source # 
Instance details

Defined in Core.Text.Rope

Associated Types

type Rep Rope :: Type -> Type #

Methods

from :: Rope -> Rep Rope x #

to :: Rep Rope x -> Rope #

Semigroup Rope Source # 
Instance details

Defined in Core.Text.Rope

Methods

(<>) :: Rope -> Rope -> Rope #

sconcat :: NonEmpty Rope -> Rope #

stimes :: Integral b => b -> Rope -> Rope #

Monoid Rope Source # 
Instance details

Defined in Core.Text.Rope

Methods

mempty :: Rope #

mappend :: Rope -> Rope -> Rope #

mconcat :: [Rope] -> Rope #

NFData Rope Source # 
Instance details

Defined in Core.Text.Rope

Methods

rnf :: Rope -> () #

Hashable Rope Source # 
Instance details

Defined in Core.Text.Rope

Methods

hashWithSalt :: Int -> Rope -> Int #

hash :: Rope -> Int #

Pretty Rope Source # 
Instance details

Defined in Core.Text.Rope

Methods

pretty :: Rope -> Doc ann #

prettyList :: [Rope] -> Doc ann #

Binary Rope Source # 
Instance details

Defined in Core.Text.Rope

Textual Rope Source # 
Instance details

Defined in Core.Text.Rope

Render Rope Source # 
Instance details

Defined in Core.Text.Utilities

Associated Types

type Token Rope Source #

type Rep Rope Source # 
Instance details

Defined in Core.Text.Rope

type Rep Rope = D1 ('MetaData "Rope" "Core.Text.Rope" "core-text-0.3.8.0-I6LlNWNPeNi7cc8wYGtU2h" 'True) (C1 ('MetaCons "Rope" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 (FingerTree Width ShortText))))
type Token Rope Source # 
Instance details

Defined in Core.Text.Utilities

type Token Rope = ()

emptyRope :: Rope Source #

A zero-length Rope. You can also use "" presuming the OverloadedStrings language extension is turned on in your source file.

singletonRope :: Char -> Rope Source #

A Rope with but a single character.

packRope :: String -> Rope Source #

A Rope built from a list of characters. Equivalent to calling intoRope on the String, but can help you avoid ambiguious type errors when composing functions or working with literals.

Since: 0.3.4

replicateRope :: Int -> Rope -> Rope Source #

Repeat the input Rope n times. The follows the same semantics as other replicate functions; if you ask for zero copies you'll get an empty text and if you ask for lots of "" you'll get ... an empty text.

Implementation note

Rather than copying the input n times, this will simply add structure to hold n references to the provided input text.

replicateChar :: Int -> Char -> Rope Source #

Repeat the input Char n times. This is a special case of replicateRope above.

Implementation note

Rather than making a huge FingerTree full of single characters, this function will allocate a single ShortText comprised of the repeated input character.

widthRope :: Rope -> Int Source #

Get the length of this text, in characters.

unconsRope :: Rope -> Maybe (Char, Rope) Source #

Read the first character from a Rope, assuming it's length 1 or greater, returning Just that character and the remainder of the text. Returns Nothing if the input is 0 length.

Since: 0.3.7

splitRope :: Int -> Rope -> (Rope, Rope) Source #

Break the text into two pieces at the specified offset.

Examples:

λ> splitRope 0 "abcdef"
("", "abcdef")
λ> splitRope 3 "abcdef"
("abc", "def")
λ> splitRope 6 "abcdef"
("abcdef","")

Going off either end behaves sensibly:

λ> splitRope 7 "abcdef"
("abcdef","")
λ> splitRope (-1) "abcdef"
("", "abcdef")

takeRope :: Int -> Rope -> Rope Source #

Take the first _n_ characters from the beginning of the Rope.

λ> takeRope 3 "123456789"
"123"

insertRope :: Int -> Rope -> Rope -> Rope Source #

Insert a new piece of text into an existing Rope at the specified offset.

Examples:

λ> insertRope 3 "Con" "Def 1"
"DefCon 1"
λ> insertRope 0 "United " "Nations"
"United Nations"

containsCharacter :: Char -> Rope -> Bool Source #

Does the text contain this character?

We've used it to ask whether there are newlines present in a Rope, for example:

    if containsCharacter '\n' text
        then handleComplexCase
        else keepItSimple

Interoperation and Output

class Textual α where Source #

Machinery to interpret a type as containing valid Unicode that can be represented as a Rope object.

Implementation notes

Given that Rope is backed by a finger tree, appendRope is relatively inexpensive, plus whatever the cost of conversion is. There is a subtle trap, however: if adding small fragments of that were obtained by slicing (for example) a large ByteString we would end up holding on to a reference to the entire underlying block of memory. This module is optimized to reduce heap fragmentation by letting the Haskell runtime and garbage collector manage the memory, so instances are expected to copy these substrings out of pinned memory.

The ByteString instance requires that its content be valid UTF-8. If not an empty Rope will be returned.

Several of the fromRope implementations are expensive and involve a lot of intermediate allocation and copying. If you're ultimately writing to a handle prefer hWrite which will write directly to the output buffer.

Minimal complete definition

fromRope, intoRope

Methods

fromRope :: Rope -> α Source #

Convert a Rope into another text-like type.

intoRope :: α -> Rope Source #

Take another text-like type and convert it to a Rope.

appendRope :: α -> Rope -> Rope Source #

Append some text to this Rope. The default implementation is basically a convenience wrapper around calling intoRope and mappending it to your text (which will work just fine, but for some types more efficient implementations are possible).

Instances

Instances details
Textual ByteString Source #

from Data.ByteString.Lazy

Instance details

Defined in Core.Text.Rope

Textual ByteString Source #

from Data.ByteString Strict

Instance details

Defined in Core.Text.Rope

Textual Builder Source #

from Data.ByteString.Builder

Instance details

Defined in Core.Text.Rope

Textual Text Source #

from Data.Text.Lazy

Instance details

Defined in Core.Text.Rope

Textual Text Source #

from Data.Text Strict

Instance details

Defined in Core.Text.Rope

Textual ShortText Source #

from Data.Text.Short

Instance details

Defined in Core.Text.Rope

Textual Bytes Source # 
Instance details

Defined in Core.Text.Rope

Textual Rope Source # 
Instance details

Defined in Core.Text.Rope

Textual [Char] Source #

from Data.String

Instance details

Defined in Core.Text.Rope

Textual (FingerTree Width ShortText) Source # 
Instance details

Defined in Core.Text.Rope

hWrite :: Handle -> Rope -> IO () Source #

Write the Rope to the given Handle.

import Core.Text
import Core.System -- re-exports stdout

main :: IO ()
main =
  let text :: Rope
      text = "Hello World"
   in hWrite stdout text

because it's tradition.

Uses hPutBuilder internally which saves all kinds of intermediate allocation and copying because we can go from the ShortTexts in the finger tree to ShortByteString to Builder to the Handle's output buffer in one go.

If you're working in the core-program Program τ monad, then the write function there provides an efficient way to write a Rope to stdout.

Internals

unRope :: Rope -> FingerTree Width ShortText Source #

Access the finger tree underlying the Rope. You'll want the following imports:

import qualified Data.FingerTree as F  -- from the fingertree package
import qualified Data.Text.Short as S  -- from the text-short package

unsafeIntoRope :: ByteString -> Rope Source #

If you know the input bytes are valid UTF-8 encoded characters, then you can use this function to convert to a piece of Rope.

copyRope :: Rope -> Rope Source #

Copy the pieces underlying a Rope into a single piece object.

Warning

This function was necessary to have a reliable Hashable instance. Currently constructing this new Rope is quite inefficient if the number of pieces or their respective lengths are large. Usually, however, we're calling hash so the value can be used as a key in a hash table and such keys are typically simple (or at least not ridiculously long), so this is not an issue in normal usage.

newtype Width Source #

The length of the Rope, in characters. This is the monoid used to structure the finger tree underlying the Rope.

Constructors

Width Int 

Instances

Instances details
Eq Width Source # 
Instance details

Defined in Core.Text.Rope

Methods

(==) :: Width -> Width -> Bool #

(/=) :: Width -> Width -> Bool #

Num Width Source # 
Instance details

Defined in Core.Text.Rope

Ord Width Source # 
Instance details

Defined in Core.Text.Rope

Methods

compare :: Width -> Width -> Ordering #

(<) :: Width -> Width -> Bool #

(<=) :: Width -> Width -> Bool #

(>) :: Width -> Width -> Bool #

(>=) :: Width -> Width -> Bool #

max :: Width -> Width -> Width #

min :: Width -> Width -> Width #

Show Width Source # 
Instance details

Defined in Core.Text.Rope

Methods

showsPrec :: Int -> Width -> ShowS #

show :: Width -> String #

showList :: [Width] -> ShowS #

Generic Width Source # 
Instance details

Defined in Core.Text.Rope

Associated Types

type Rep Width :: Type -> Type #

Methods

from :: Width -> Rep Width x #

to :: Rep Width x -> Width #

Semigroup Width Source # 
Instance details

Defined in Core.Text.Rope

Methods

(<>) :: Width -> Width -> Width #

sconcat :: NonEmpty Width -> Width #

stimes :: Integral b => b -> Width -> Width #

Monoid Width Source # 
Instance details

Defined in Core.Text.Rope

Methods

mempty :: Width #

mappend :: Width -> Width -> Width #

mconcat :: [Width] -> Width #

Measured Width ShortText Source # 
Instance details

Defined in Core.Text.Rope

Methods

measure :: ShortText -> Width #

Textual (FingerTree Width ShortText) Source # 
Instance details

Defined in Core.Text.Rope

type Rep Width Source # 
Instance details

Defined in Core.Text.Rope

type Rep Width = D1 ('MetaData "Width" "Core.Text.Rope" "core-text-0.3.8.0-I6LlNWNPeNi7cc8wYGtU2h" 'True) (C1 ('MetaCons "Width" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 Int)))

Orphan instances

Measured Width ShortText Source # 
Instance details

Methods

measure :: ShortText -> Width #