Safe Haskell | None |
---|---|
Language | Haskell2010 |
If you're accustomed to working with text in almost any other programming
language, you'd be aware that a "string" typically refers to an in-memory
array of characters. Traditionally this was a single ASCII byte per
character; more recently UTF-8 variable byte encodings which dramatically
complicates finding offsets but which gives efficient support for the entire
Unicode character space. In Haskell, the original text type, String
, is
implemented as a list of Char
which, because a Haskell list is implemented
as a linked-list of boxed values, is wildly inefficient at any kind of
scale.
In modern Haskell there are two primary ways to represent text.
First is via the [rather poorly named] ByteString
from the bytestring
package (which is an array of bytes in pinned memory). The
Data.ByteString.Char8 submodule gives you ways to manipulate those arrays as
if they were ASCII characters. Confusingly there are both strict
(Data.ByteString
) and lazy (Data.ByteString.Lazy
) variants which are often
hard to tell the difference between when reading function signatures or
haddock documentation. The performance problem an immutable array backed data
type runs into is that appending a character (that is, ASCII byte) or
concatonating a string (that is, another array of ASCII bytes) is very
expensive and requires allocating a new larger array and copying the whole
thing into it. This led to the development of "builders" which amortize this
reallocation cost over time, but it can be cumbersome to switch between
Builder
, the lazy ByteString
that results, and then having to inevitably
convert to a strict ByteString
because that's what the next function in your
sequence requires.
The second way is through the opaque Text
type of Data.Text from the
text package, which is well tuned and high-performing but suffers from the
same design; it is likewise backed by arrays. (Historically, the storage
backing Text objects was encoded in UTF-16, meaning every time you wanted to
work with unicode characters that came in from anywhere else and which
inevitably were UTF-8 encoded they had to be converted to UTF-16 and copied
into a further new array! Fortunately Haskell has recently adopted a UTF-8
backed Text
type, reducing this overhead. The challenge of appending pinned
allocations remains, however.)
In this package we introduce Rope
, a text type backed by the 2-3
FingerTree
data structure from the fingertree package.
This is not an uncommon solution in many languages as finger trees support
exceptionally efficient appending to either end and good performance inserting
anywhere else (you often find them as the backing data type underneath text
editors for this reason). Rather than Char
the pieces of the rope are
ShortText
from the text-short package, which are UTF-8
encoded and in normal memory managed by the Haskell runtime. Conversion from
other Haskell text types is not O(1) (UTF-8 validity must be checked, or
UTF-16 decoded, or...), but in our benchmarking the performance has been
comparable to the established types and you may find the resultant interface
for combining chunks is comparable to using a Builder, without being forced to
use a Builder.
Rope
is used as the text type throughout this library. If you use the
functions within this package (rather than converting to other text types)
operations are quite efficient. When you do need to convert to another type
you can use fromRope
or intoRope
from the Textual
typeclass.
Note that we haven't tried to cover the entire gamut of operations or
customary convenience functions you would find in the other libraries; so far
Rope
is concentrated on aiding interoperation, being good at appending (lots
of) small pieces, and then efficiently taking the resultant text object out to
a file handle, be that the terminal console, a file, or a network socket.
Synopsis
- data Rope
- emptyRope :: Rope
- singletonRope :: Char -> Rope
- packRope :: String -> Rope
- replicateRope :: Int -> Rope -> Rope
- replicateChar :: Int -> Char -> Rope
- widthRope :: Rope -> Int
- unconsRope :: Rope -> Maybe (Char, Rope)
- splitRope :: Int -> Rope -> (Rope, Rope)
- takeRope :: Int -> Rope -> Rope
- insertRope :: Int -> Rope -> Rope -> Rope
- containsCharacter :: Char -> Rope -> Bool
- findIndexRope :: (Char -> Bool) -> Rope -> Maybe Int
- class Textual α where
- hWrite :: Handle -> Rope -> IO ()
- unRope :: Rope -> FingerTree Width ShortText
- nullRope :: Rope -> Bool
- unsafeIntoRope :: ByteString -> Rope
- copyRope :: Rope -> Rope
- newtype Width = Width Int
Rope type
A type for textual data. A rope is text backed by a tree data structure, rather than a single large continguous array, as is the case for strings.
There are three use cases:
Referencing externally sourced data
Often we interpret large blocks of data sourced from external systems as text.
Ideally we would hold onto this without copying the memory, but (as in the
case of ByteString
which is the most common source of data) before we can
treat it as text we have to validate the UTF-8 content. Safety first. We also
copy it out of pinned memory, allowing the Haskell runtime to manage the
storage.
Interoperating with other libraries
The only constant of the Haskell universe is that you won't have the right
combination of {strict, lazy} × {Text
, ByteString
, String
, [Word8]
,
etc} you need for the next function call. The Textual
typeclass provides for
moving between different text representations. To convert between Rope
and
something else use fromRope
; to construct a Rope
from textual content in
another type use intoRope
.
You can get at the underlying finger tree with the unRope
function.
Assembling text to go out
This involves considerable appending of data, very very occaisionally
inserting it. Often the pieces are tiny. To add text to a Rope
use the
appendRope
method as below or the (<>
) operator from
Data.Monoid (like you would have with a Builder
).
Output to a Handle
can be done efficiently with hWrite
.
Instances
Eq Rope Source # | |
Ord Rope Source # | |
Show Rope Source # | |
IsString Rope Source # | |
Defined in Core.Text.Rope fromString :: String -> Rope # | |
Generic Rope Source # | |
Semigroup Rope Source # | |
Monoid Rope Source # | |
NFData Rope Source # | |
Defined in Core.Text.Rope | |
Hashable Rope Source # | |
Defined in Core.Text.Rope | |
Pretty Rope Source # | |
Defined in Core.Text.Rope | |
Binary Rope Source # | |
Textual Rope Source # | |
Render Rope Source # | |
type Rep Rope Source # | |
Defined in Core.Text.Rope | |
type Token Rope Source # | |
Defined in Core.Text.Utilities |
A zero-length Rope
. You can also use ""
presuming the
OverloadedStrings
language extension is turned on in your source file.
replicateRope :: Int -> Rope -> Rope Source #
Repeat the input Rope
n
times. The follows the same semantics as other
replicate
functions; if you ask for zero copies you'll get an empty text and
if you ask for lots of ""
you'll get ... an empty text.
Implementation note
Rather than copying the input n times, this will simply add structure to hold n references to the provided input text.
replicateChar :: Int -> Char -> Rope Source #
Repeat the input Char
n
times. This is a special case of replicateRope
above.
Implementation note
Rather than making a huge FingerTree full of single characters, this function will allocate a single ShortText comprised of the repeated input character.
splitRope :: Int -> Rope -> (Rope, Rope) Source #
Break the text into two pieces at the specified offset.
Examples:
λ> splitRope 0 "abcdef" ("", "abcdef") λ> splitRope 3 "abcdef" ("abc", "def") λ> splitRope 6 "abcdef" ("abcdef","")
Going off either end behaves sensibly:
λ> splitRope 7 "abcdef" ("abcdef","") λ> splitRope (-1) "abcdef" ("", "abcdef")
takeRope :: Int -> Rope -> Rope Source #
Take the first _n_ characters from the beginning of the Rope.
λ> takeRope 3 "123456789" "123"
insertRope :: Int -> Rope -> Rope -> Rope Source #
Insert a new piece of text into an existing Rope
at the specified offset.
Examples:
λ> insertRope 3 "Con" "Def 1" "DefCon 1" λ> insertRope 0 "United " "Nations" "United Nations"
containsCharacter :: Char -> Rope -> Bool Source #
Does the text contain this character?
We've used it to ask whether there are newlines present in a Rope
, for
example:
if containsCharacter
'\n' text
then handleComplexCase
else keepItSimple
Interoperation and Output
class Textual α where Source #
Machinery to interpret a type as containing valid Unicode that can be
represented as a Rope
object.
Implementation notes
Given that Rope
is backed by a finger tree, appendRope
is relatively
inexpensive, plus whatever the cost of conversion is. There is a subtle trap,
however: if adding small fragments of that were obtained by slicing (for
example) a large ByteString
we would end up holding on to a
reference to the entire underlying block of memory. This module is optimized
to reduce heap fragmentation by letting the Haskell runtime and garbage
collector manage the memory, so instances are expected to copy these
substrings out of pinned memory.
The ByteString
instance requires that its content be valid
UTF-8. If not an empty Rope
will be returned.
Several of the fromRope
implementations are expensive and involve a lot of
intermediate allocation and copying. If you're ultimately writing to a handle
prefer hWrite
which will write directly to the output buffer.
fromRope :: Rope -> α Source #
Convert a Rope
into another text-like type.
intoRope :: α -> Rope Source #
Take another text-like type and convert it to a Rope
.
appendRope :: α -> Rope -> Rope Source #
Instances
Textual ByteString Source # | from Data.ByteString.Lazy |
Defined in Core.Text.Rope fromRope :: Rope -> ByteString Source # intoRope :: ByteString -> Rope Source # appendRope :: ByteString -> Rope -> Rope Source # | |
Textual ByteString Source # | from Data.ByteString Strict |
Defined in Core.Text.Rope fromRope :: Rope -> ByteString Source # intoRope :: ByteString -> Rope Source # appendRope :: ByteString -> Rope -> Rope Source # | |
Textual Text Source # | from Data.Text.Lazy |
Textual Text Source # | from Data.Text Strict |
Textual ShortText Source # | from Data.Text.Short |
Textual Bytes Source # | |
Textual Rope Source # | |
Textual [Char] Source # | from Data.String |
Textual (FingerTree Width ShortText) Source # | |
Defined in Core.Text.Rope |
hWrite :: Handle -> Rope -> IO () Source #
Write the Rope
to the given Handle
.
import Core.Text import Core.System -- re-exports stdout main :: IO () main = let text ::Rope
text = "Hello World" inhWrite
stdout
text
because it's tradition.
Uses hPutBuilder
internally which saves all kinds of
intermediate allocation and copying because we can go from the
ShortText
s in the finger tree to
ShortByteString
to Builder
to the Handle'
s output buffer in one go.
If you're working in the
<https://hackage.haskell.org/package/core-program/docs/Core-Program-Execute.html#t:Program
Program> monad, then
<https://hackage.haskell.org/package/core-program/docs/Core-Program-Logging.html#v:write
write> provides an efficient way to write a Rope
to stdout
.
Internals
unRope :: Rope -> FingerTree Width ShortText Source #
Access the finger tree underlying the Rope
. You'll want the following
imports:
import qualified Data.FingerTree as F -- from the fingertree package import qualified Data.Text.Short as S -- from the text-short package
unsafeIntoRope :: ByteString -> Rope Source #
If you know the input bytes are valid UTF-8 encoded characters, then you can
use this function to convert to a piece of Rope
.
copyRope :: Rope -> Rope Source #
Copy the pieces underlying a Rope
into a single piece object.
Warning
This function was necessary to have a reliable Hashable
instance. Currently
constructing this new Rope
is quite inefficient if the number of pieces or
their respective lengths are large. Usually, however, we're calling
hash
so the value can be used as a key in a hash table and
such keys are typically simple (or at least not ridiculously long), so this is
not an issue in normal usage.
The length of the Rope
, in characters. This is the monoid used to
structure the finger tree underlying the Rope
.
Instances
Eq Width Source # | |
Num Width Source # | |
Ord Width Source # | |
Show Width Source # | |
Generic Width Source # | |
Semigroup Width Source # | |
Monoid Width Source # | |
Measured Width ShortText Source # | |
Defined in Core.Text.Rope | |
Textual (FingerTree Width ShortText) Source # | |
Defined in Core.Text.Rope | |
type Rep Width Source # | |
Defined in Core.Text.Rope |