alfred-margaret-2.0.0.0: Fast Aho-Corasick string searching
Safe HaskellNone
LanguageHaskell2010

Data.Text.Utf8

Description

This module provides functions that allow treating Text values as series of UTF-8 code units instead of characters. Any calls to Text in alfred-margaret go through this module. Therefore we re-export some Text functions, e.g. concat.

Synopsis

Documentation

type CodePoint = Char Source #

A Unicode code point.

type CodeUnit = Word8 Source #

A UTF-8 code unit is a byte. A Unicode code point can be encoded as up to four code units.

newtype CodeUnitIndex Source #

An index into the raw UTF-8 data of a Text. This is not the code point index as conventionally accepted by Text, so we wrap it to avoid confusing the two. Incorrect index manipulation can lead to surrogate pairs being sliced, so manipulate indices with care. This type is also used for lengths.

Constructors

CodeUnitIndex 

Fields

Instances

Instances details
Bounded CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Eq CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Num CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Ord CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Show CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Generic CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Associated Types

type Rep CodeUnitIndex :: Type -> Type #

Hashable CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

ToJSON CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

FromJSON CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

NFData CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

Methods

rnf :: CodeUnitIndex -> () #

type Rep CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf8

type Rep CodeUnitIndex = D1 ('MetaData "CodeUnitIndex" "Data.Text.Utf8" "alfred-margaret-2.0.0.0-AHkhEWrSlE7G50v2ifxOcH" 'True) (C1 ('MetaCons "CodeUnitIndex" 'PrefixI 'True) (S1 ('MetaSel ('Just "codeUnitIndex") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 Int)))

data Text #

A space efficient, packed, unboxed Unicode text type.

Constructors

Text !Array !Int !Int 

Instances

Instances details
Hashable Text 
Instance details

Defined in Data.Hashable.Class

Methods

hashWithSalt :: Int -> Text -> Int #

hash :: Text -> Int #

ToJSON Text 
Instance details

Defined in Data.Aeson.Types.ToJSON

ToJSONKey Text 
Instance details

Defined in Data.Aeson.Types.ToJSON

FromJSON Text 
Instance details

Defined in Data.Aeson.Types.FromJSON

FromJSONKey Text 
Instance details

Defined in Data.Aeson.Types.FromJSON

Chunk Text 
Instance details

Defined in Data.Attoparsec.Internal.Types

Associated Types

type ChunkElem Text #

type State Text 
Instance details

Defined in Data.Attoparsec.Internal.Types

type State Text = Buffer
type ChunkElem Text 
Instance details

Defined in Data.Attoparsec.Internal.Types

type Item Text 
Instance details

Defined in Data.Text

type Item Text = Char

isCaseInvariant :: Text -> Bool Source #

Return whether text is the same lowercase as uppercase, such that this function will not return true when Aho–Corasick would differentiate when doing case-insensitive matching.

lengthUtf8 :: Text -> CodeUnitIndex Source #

The return value of this function is not really an index. However the signature is supposed to make it clear that the length is returned in terms of code units, not code points.

lowerCodePoint :: Char -> Char Source #

Lower-Case a UTF-8 codepoint. Uses toLowerAscii for ASCII and toLower otherwise.

lowerUtf8 :: Text -> Text Source #

Lowercase a Text by applying lowerCodePoint to each Char.

toLowerAscii :: Char -> Char Source #

Lower-case the ASCII code points A-Z and leave the rest of ASCII intact.

unicode2utf8 :: (Ord a, Num a, Bits a) => a -> [a] Source #

Convert a Unicode Code Point c into a list of UTF-8 code units (bytes).

Decoding

Functions that turns code unit sequences into code point sequences.

decode2 :: CodeUnit -> CodeUnit -> CodePoint Source #

Decode 2 UTF-8 code units into their code point. The given code units should have the following format:

┌───────────────┬───────────────┐
│1 1 0 x x x x x│1 0 x x x x x x│
└───────────────┴───────────────┘

decode3 :: CodeUnit -> CodeUnit -> CodeUnit -> CodePoint Source #

Decode 3 UTF-8 code units into their code point. The given code units should have the following format:

┌───────────────┬───────────────┬───────────────┐
│1 1 1 0 x x x x│1 0 x x x x x x│1 0 x x x x x x│
└───────────────┴───────────────┴───────────────┘

decode4 :: CodeUnit -> CodeUnit -> CodeUnit -> CodeUnit -> CodePoint Source #

Decode 4 UTF-8 code units into their code point. The given code units should have the following format:

┌───────────────┬───────────────┬───────────────┬───────────────┐
│1 1 1 1 0 x x x│1 0 x x x x x x│1 0 x x x x x x│1 0 x x x x x x│
└───────────────┴───────────────┴───────────────┴───────────────┘

decodeUtf8 :: [CodeUnit] -> [CodePoint] Source #

Decode a list of UTF-8 code units into a list of code points.

Indexing

Text can be indexed by code units or code points. A CodePoint is a 21-bit Unicode code point and can consist of up to four code units. A CodeUnit is a single byte.

indexCodeUnit :: Text -> CodeUnitIndex -> CodeUnit Source #

Get the code unit at the given CodeUnitIndex. Performs bounds checking.

unsafeIndexCodePoint :: Text -> CodeUnitIndex -> (CodeUnitIndex, CodePoint) Source #

Does exactly the same thing as unsafeIndexCodePoint', but on Text values.

Slicing Functions

unsafeCutUtf8 and unsafeSliceUtf8 are used to retrieve slices of Text values. unsafeSliceUtf8 begin length returns a substring of length length starting at begin. unsafeSliceUtf8 begin length returns a tuple of the "surrounding" substrings.

They satisfy the following property:

let (prefix, suffix) = unsafeCutUtf8 begin length t
in concat [prefix, unsafeSliceUtf8 begin length t, suffix] == t

The following diagram visualizes the relevant offsets for begin = CodeUnitIndex 2, length = CodeUnitIndex 6 and t = "BCDEFGHIJKL".

 off                 off+len
  │                     │
  ▼                     ▼
──┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬──
 A│B│C│D│E│F│G│H│I│J│K│L│M│N
──┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴──
      ▲           ▲
      │           │
 off+begin   off+begin+length

unsafeSliceUtf8 begin length t == "DEFGHI"
unsafeCutUtf8 begin length t == ("BC", "JKL")

The shown array is open at each end because in general, t may be a slice as well.

WARNING: As their name implies, these functions are not (necessarily) bounds-checked. Use at your own risk.

unsafeCutUtf8 Source #

Arguments

:: CodeUnitIndex

Starting position of substring.

-> CodeUnitIndex

Length of substring.

-> Text

Initial string.

-> (Text, Text) 

Functions on Arrays

Functions for working with Array values.

unsafeIndexCodePoint' :: Array -> CodeUnitIndex -> (CodeUnitIndex, CodePoint) Source #

Decode a code point at the given CodeUnitIndex. Returns garbage if there is no valid code point at that position. Does not perform bounds checking. See decode2, decode3 and decode4 for the expected format of multi-byte code points.

General Functions

Re-exported from Text.

concat :: [Text] -> Text #

O(n) Concatenate a list of Texts.

dropWhile :: (Char -> Bool) -> Text -> Text #

O(n) dropWhile p t returns the suffix remaining after takeWhile p t.

isInfixOf :: Text -> Text -> Bool #

O(n+m) The isInfixOf function takes two Texts and returns True iff the first is contained, wholly and intact, anywhere within the second.

In (unlikely) bad cases, this function's time complexity degrades towards O(n*m).

null :: Text -> Bool #

O(1) Tests whether a Text is empty or not.

pack :: String -> Text #

O(n) Convert a String into a Text. Performs replacement on invalid scalar values.

replicate :: Int -> Text -> Text #

O(n*m) replicate n t is a Text consisting of the input t repeated n times.

unpack :: Text -> String #

O(n) Convert a Text into a String.

indices #

Arguments

:: Text

Substring to search for (needle)

-> Text

Text to search in (haystack)

-> [Int] 

O(n+m) Find the offsets of all non-overlapping indices of needle within haystack.

In (unlikely) bad cases, this algorithm's complexity degrades towards O(n*m).