| Safe Haskell | Safe-Inferred |
|---|---|
| Language | GHC2021 |
Data.Series.Tutorial
Synopsis
Introduction
This is a short user guide on how to get started using javelin and its various modules.
The central data structure at the heart of this package is the Series. A
is a labeled array of type Series k av filled with values of type a, indexed by keys of type k.
Like Map, Series support efficient:
- random access by key ( \(O(\log n)\) );
- slice by key ( \(O(\log n)\) ).
Like Vector, Series support efficient:
- numerical operations.
- random access by index ( \(O(1)\) );
- slice by index ( \(O(1)\) );
To follow along this tutorial, the following imports are expected:
>>>import Data.Series as Series
Construction
The easiest way to create a Series is to do it from a list using fromList:
>>>Series.fromList [ ('a', 1::Int), ('b', 2), ('c', 3), ('d', 4) ]index | values ----- | ------ 'a' | 1 'b' | 2 'c' | 3 'd' | 4
Note what happens when we have the same key ('a') attached to multiple values:
>>>Series.fromList [ ('a', 1::Int), ('a', 0), ('b', 2), ('c', 3), ('d', 4) ]index | values ----- | ------ 'a' | 0 'b' | 2 'c' | 3 'd' | 4
Series, like Maps, have unique keys; therefore, the output series may
not be the same length as the input series. See further below for an
explanation of how to handle duplicate keys.
Since Series are like Map, it's easy to convert between the two:
>>>let mp = Data.Map.Strict.fromList [ ('a', 0::Int), ('a', 1), ('b', 2), ('c', 3), ('d', 4) ]>>>mpfromList [('a',1),('b',2),('c',3),('d',4)]>>>Series.fromStrictMap mpindex | values ----- | ------ 'a' | 1 'b' | 2 'c' | 3 'd' | 4
Of course, fromLazyMap is also available. In fact, conversion to/from Series is supported for
many types; see the IsSeries typeclass and its methods, toSeries and fromSeries.
Index
Series have two components: values and an index.
The index (of type ) is an ordered set of unique elements which allows to determine
where are each values in the series. Since all keys in an Index kIndex are unique and sorted, it
is fast to find the value associated to any random key.
As we'll see soon, Index is an important data structure which can be used to slice through a Series,
so let's get comfortable with them.
>>>import qualified Data.Series.Index as Index
An Index can be constructed from a list:
>>>Index.fromList [5::Int,5,4,3,2,1,5,5,5]Index [1,2,3,4,5]
As you see above, repeated elements (in this case, 5) won't be repeated in the Index. Therefore, it often makes
more sense to construct an Index using fromSet from a Set from Data.Set.
One common way to construct an Index is to programmatically unfold a seed value using
unfoldr. Below, we want to generate numbers from 7 down to 1:
>>>Index.unfoldr (\x -> if x < 1 then Nothing else Just (x, x-1)) (7 :: Int)Index [1,2,3,4,5,6,7]
This task is so common that there is a convenience function to create ranges, range.
For example, if you want to create an Index of values starting at 1 and ending at 10, in
steps of 3:
>>>Index.range (+3) (1 :: Int) 10Index [1,4,7,10]
An Index is very much like a Set, so you can
- check for membership using
member; - combine two
Indexusingunion,intersection, anddifference; - find the integer index of a key using
lookupIndex;
and more.
Selections
Single-key selection
Single-element selections are performed using at, which selects a single element by key. at is safe;
if the key is missing, Nothing is returned:
>>>let xs = Series.fromList [ ('a', 1::Int), ('b', 2), ('c', 3), ('d', 4) ]>>>xsindex | values ----- | ------ 'a' | 1 'b' | 2 'c' | 3 'd' | 4>>>xs `at` 'a'Just 1>>>xs `at` 'z'Nothing
Bulk selections
Bulk selection, also known as *slicing*, is the method by which we extract a sub-series from a series.
In the examples below, we'll assume that we have the series aapl_close is available in-scope, which represents
the closing price of Apple stock:
>>>:{let aapl_close = Series.fromList [ ("2010-01-04", 6.5522 :: Double) , ("2010-01-05", 6.5636) , ("2010-01-06", 6.4592) , ("2010-01-07", 6.4472) , ("2010-01-08", 6.4901) -- No prices during the weekend , ("2010-01-11", 6.5152) , ("2010-01-12", 6.4047) , ("2010-01-13", 6.3642) , ("2010-01-14", 6.4328) , ("2010-01-15", 6.4579) ] :}
Bulk selection is done via the select function. select works with many types of inputs.
For example, we can query for a contiguous range of keys by using to:
>>>aapl_close `select` "2010-01-04" `to` "2010-01-08"index | values ----- | ------ "2010-01-04" | 6.5522 "2010-01-05" | 6.5636 "2010-01-06" | 6.4592 "2010-01-07" | 6.4472 "2010-01-08" | 6.4901
You can also request unbounded ranges. For example all dates up to "2010-01-08" using upto:
>>>aapl_close `select` upto "2010-01-08"index | values ----- | ------ "2010-01-04" | 6.5522 "2010-01-05" | 6.5636 "2010-01-06" | 6.4592 "2010-01-07" | 6.4472 "2010-01-08" | 6.4901
There's also the other unbound range, from:
>>>aapl_close `select` from "2010-01-11"index | values ----- | ------ "2010-01-11" | 6.5152 "2010-01-12" | 6.4047 "2010-01-13" | 6.3642 "2010-01-14" | 6.4328 "2010-01-15" | 6.4579
Note that the bounds may contain less data than you think! For example, let's look at a 5-day range:
>>>aapl_close `select` "2010-01-08" `to` "2010-01-12"index | values ----- | ------ "2010-01-08" | 6.4901 "2010-01-11" | 6.5152 "2010-01-12" | 6.4047
We've requested a range of 5 days ("2010-01-08", "2010-01-09", "2010-01-10", "2010-01-11", "2010-01-12"),
but there's no data in our series with the keys "2010-01-09" and "2010-01-10", because it was the week-end
(stock markets are usually closed on week-ends).
Sometimes you want to be more specific than a contiguous range of data; select
also supports bulk *random* access like so:
>>>aapl_close `select` ["2010-01-08", "2010-01-10", "2010-01-12"]index | values ----- | ------ "2010-01-08" | 6.4901 "2010-01-12" | 6.4047
Note above that we've requested data for the date "2010-01-10", but it's missing. Therefore,
the data isn't returned. If you want to get a sub-series which has the exact index that
you've asked for, you can use require in combination with an Index:
>>>import qualified Data.Series.Index as Index>>>aapl_close `require` Index.fromList ["2010-01-08", "2010-01-10", "2010-01-12"]index | values ----- | ------ "2010-01-08" | Just 6.4901 "2010-01-10" | Nothing "2010-01-12" | Just 6.4047
Using require or select in conjunction with range is very powerful.
Filtering and mapping
Series support operations on both their index and their values. To illustrate
this, let's load some latitude and longitude data for some cities.
We'll assume that the following types are in scope:
>>>import Data.Fixed (Centi)>>>data Position = Pos { latitude :: Centi, longitude :: Centi } deriving (Show)>>>:{let cities = Series.fromList [ ("Paris"::String , Pos 48.86 2.35) , ("New York City" , Pos 40.71 (-74.01)) , ("Taipei" , Pos 25.04 121.56) , ("Buenos Aires" , Pos (-34.60) (-58.38)) ] :}
We can easily filter for data just like you would filter a list.
In this example, let's find cities in the western hemisphere (i.e. cities
which have negative longitudes), using filter:
>>>Series.filter (\pos -> longitude pos < 0) citiesindex | values ----- | ------ "Buenos Aires" | Pos {latitude = -34.60, longitude = -58.38} "New York City" | Pos {latitude = 40.71, longitude = -74.01}
We can transform the values of a Series using map. In this example,
let's isolate the latitude of cities in the western hemisphere:
>>>let western_cities = Series.filter (\pos -> longitude pos < 0) cities>>>Series.map latitude western_citiesindex | values ----- | ------ "Buenos Aires" | -34.60 "New York City" | 40.71
Finally, we can summarize the Series by reducing all its values.
Let's average the latitude of cities in the western hemisphere:
>>>import Data.Series ( mean )>>>let latitudes = Series.map latitude western_cities>>>Series.fold mean latitudes3.05
The next section introduces fold more generally.
Folding
Folding refers to the action of aggregating values in a Series to a single value.
Folding Series is done through the fold function. Its type signature is:
>>>:t Series.foldSeries.fold :: Fold a b -> Series k a -> b
Here, represents a calculation which takes in values of type Fold a ba, and will ultimately produce a
final value of type b. Such calculations are provided by the foldl package (see Foldl), although
some of its functions are re-exported by Data.Series (and Data.Series.Unboxed), such as mean.
Let's look at an example. First, we'll need some data. We'll use end-of-day stock prices for Apple Inc:
>>>import Data.Fixed ( Centi )>>>(aapl_closing :: Series String Double) <- (Series.fromList . read) <$> readFile "files/aapl.txt">>>aapl_closingindex | values ----- | ------ "1980-12-12" | 0.1007 "1980-12-15" | 9.54e-2 "1980-12-16" | 8.84e-2 ... | ... "2022-01-05" | 174.92 "2022-01-06" | 172.0 "2022-01-07" | 172.17
Normally we would use an appropriate datetime type for the index of aapl_closing,
for example from the time package, but we're keeping it simple for this tutorial.
Prices have changed a lot over the years, so we'll restrict ourselves to 2021:
>>>let aapl_closing_2021 = aapl_closing `select` "2021-01-01" `to` "2021-12-31">>>aapl_closing_2021index | values ----- | ------ "2021-01-04" | 128.6174 "2021-01-05" | 130.2076 "2021-01-06" | 125.8246 ... | ... "2021-12-29" | 179.38 "2021-12-30" | 178.2 "2021-12-31" | 177.57
To calculate the average closing price over the year 2021, we use fold in conjunction with
mean:
>>>Series.fold Series.mean aapl_closing_2021140.61256349206354
One of the magic things about Fold is that it's possible to combine them in such a way that you can
traverse a Series only once, which is important for good performance. As an example, we'll calculate
both the mean closing price AND the standard deviation of closing prices.
>>>let meanAndStdDev = (,) <$> Data.Series.mean <*> Data.Series.std>>>Series.fold meanAndStdDev aapl_closing_2021(140.61256349206354,14.811663837435361)
See Foldl from the foldl package for more information on Fold.
Grouping
One important feature of Series is the ability to efficiently group values
together based on their keys.
Let's load some stock price data again for this part:
>>>import Data.Fixed ( Centi )>>>(aapl_closing :: Series String Double) <- (Series.fromList . read) <$> readFile "files/aapl.txt">>>aapl_closingindex | values ----- | ------ "1980-12-12" | 0.1007 "1980-12-15" | 9.54e-2 "1980-12-16" | 8.84e-2 ... | ... "2022-01-05" | 174.92 "2022-01-06" | 172.0 "2022-01-07" | 172.17
Grouping involves two steps:
- Grouping keys in some way using
groupBy; - Aggregating the values in each group using
aggregateWithor other variants.
Let's find the highest closing price of each month. First, we need to define our grouping function:
>>>:{-- | Extract the year and month from a date like XXXX-YY-ZZ. For example: -- -- >>> month "2023-01-01" -- "2023-01" month :: String -> String month = take 7 :}
Then, we can group keys by month and take the maximum of each group:
>>>aapl_closing `groupBy` month `aggregateWith` maximumindex | values ----- | ------ "1980-12" | 0.1261 "1981-01" | 0.1208 "1981-02" | 0.1007 ... | ... "2021-11" | 165.3 "2021-12" | 180.33 "2022-01" | 182.01
This means, for example, that the maximum closing price for Apple stock in the
month of November 2021 was $165.30 per share. This library also contains
numerical aggregation functions such as mean and std. Therefore, in order
to find the monthly average Apple closing price, rounded to the nearest cent:
>>>import Data.Series (mean)>>>let (roundToCent :: Double -> Double) = \x -> fromIntegral ((round $ x * 100) :: Int) / 100>>>aapl_closing `groupBy` month `aggregateWith` (roundToCent . Series.fold mean)index | values ----- | ------ "1980-12" | 0.11 "1981-01" | 0.11 "1981-02" | 9.0e-2 ... | ... "2021-11" | 154.21 "2021-12" | 173.55 "2022-01" | 176.16
Window aggregation
Windowing aggregation refers to the practice of aggregating values in a window around every key.
General-purpose windowing is done using the windowing function. Let's look at its
type signature:
>>>:t windowingwindowing :: Ord k => (k -> Range k) -> (Series k a -> b) -> Series k a -> Series k b
Here, is a new series windowing window aggfunc xs where
for every key Series k bk, the values in the range window k are aggregated by aggfunc
and placed in the resulting series at key k. Here's an example where
for every key k, we add the values at k and k+1:
>>>:{let (xs :: Series Int Int) = Series.fromList [ (1, 0) , (2, 1) , (3, 2) , (4, 3) , (5, 4) , (6, 5) ] in windowing (\k -> k `to` (k + 1)) sum xs :} index | values ----- | ------ 1 | 1 2 | 3 3 | 5 4 | 7 5 | 9 6 | 5
windowing can be used to compute so-called rolling aggregations. An example of
this is to compute the rolling mean of the last 3 keys:
>>>import Data.Series ( mean )>>>:{let rollingMean = windowing (\k -> (k-3) `to` k) (Series.fold mean) (xs :: Series Int Double) = Series.fromList [ (1, 0) , (2, 1) , (3, 2) , (4, 3) , (5, 4) , (6, 5) ] in (rollingMean xs) :: Series Int Double :} index | values ----- | ------ 1 | 0.0 2 | 0.5 3 | 1.0 4 | 1.5 5 | 2.5 6 | 3.5
Combining Series together
An important class of operations are combining two Series together, also known as *zipping*.
For lists, Haskell has zipWith. Series also have zipWith and variants:
zipWith, which combines two series with some elementwise function;zipWithMatched, which combines two series with some elementwise function on keys which are in *both* maps;zipWithStrategy, which combines two series with some elementwise function and supports custom operations to deal with missing keys;
To illustrate the differences between the various zipping functions, consider the following two series. There's population:
>>>:set -XNumericUnderscores>>>import Data.Fixed (Centi)>>>:{-- Most recent population estimate rounded to the nearest million let population = Series.fromList [ ("Canada"::String, 40_000_000::Centi) , ("Kenya" , 56_000_000) , ("Poland" , 38_000_000) , ("Singapore" , 6_000_000) ] :}
and there's total land mass:
>>>:{-- Land mass in square kilometer let landmass = Series.fromList [ ("Brazil"::String, 8_520_000::Centi) , ("Canada", 9_990_000) , ("Kenya", 580_000) , ("Poland", 313_000) ] :}
combines the series zipWith f left rightleft and right using the
function f which admits two arguments, for all keys one-by-one. If a key
is missing from either left or right, zipWith returns Nothing. For example,
the population density per country would be:
>>>Series.zipWith (/) population landmassindex | values ----- | ------ "Brazil" | Nothing "Canada" | Just 4.00 "Kenya" | Just 96.55 "Poland" | Just 121.40 "Singapore" | Nothing
Since we don't have population estimates for Brazil and no land mass information for Singapore, we can't calculate their population densities.
Sometimes, we only care about the results of where keys are
in both series. In this case, we can use zipWith fzipWithMatched:
>>>Series.zipWithMatched (/) population landmassindex | values ----- | ------ "Canada" | 4.00 "Kenya" | 96.55 "Poland" | 121.40
Finally, in case we want full control over what to do when a key is missing, we can use @Series.zipWithStrategy'. For example, consider the case where:
- If population numbers are missing, I want to set the density to 0;
- If land mass information is missing, I wait to skip calculating the density of this country.
>>>import Data.Series (skipStrategy, constStrategy)>>>let noPopulationStrategy = Series.constStrategy 0>>>let noLandmassStrategy = Series.skipStrategy>>>Series.zipWithStrategy (/) noPopulationStrategy noLandmassStrategy population landmassindex | values ----- | ------ "Canada" | 4.00 "Kenya" | 96.55 "Poland" | 121.40 "Singapore" | 0.00
As you can imagine, zipWithStrategy is the most general and gives the most control, but is less easy
to use than zipWith and zipWithMatched.
Conclusion
This section concludes the introductory tutorial to the javelin package and its Data.Series module.
For a more in-depth look at this package, you can read the full documentation for each module:
Advanced topics
Handling duplicate keys
If you must build a Series with duplicate keys, you can use the fromListDuplicates or
fromVectorDuplicates functions.
In the example below, the key 'd' is repeated three times:
>>>Series.fromListDuplicates [('b', 0::Int), ('a', 5), ('d', 1), ('d', -4), ('d', 7) ]index | values ----- | ------ ('a',0) | 5 ('b',0) | 0 ('d',0) | 1 ('d',1) | -4 ('d',2) | 7
Note that the Series produced by fromListDuplicates still has unique keys, but each key is a
composite of a character and an occurrence. This is reflected in the type:
>>>:t Series.fromListDuplicates [('b', 0::Int), ('a', 5), ('d', 1), ('d', -4), ('d', 7) ]Series.fromListDuplicates [('b', 0::Int), ('a', 5), ('d', 1), ('d', -4), ('d', 7) ] :: Series (Char, Occurrence) Int
Here, Occurrence is a non-negative number, and can be converted to
other integer-like numbers using fromIntegral. In practice, you should aim to aggregate your Series to remove duplicate keys, for example
using groupBy and grouping on the first element of the key (fst):
>>>let xs = Series.fromListDuplicates [('b', 0::Int), ('a', 5), ('d', 1), ('d', -4), ('d', 7) ]>>>xs `groupBy` fst `aggregateWith` sumindex | values ----- | ------ 'a' | 5 'b' | 0 'd' | 4
Unboxed and generic series
The Series defined in Data.Series are based on Vector from Data.Vector.
This implementation is nice because such Series can hold _any_ Haskell type. However, because
Haskell types can be arbitrarily complex, numerical operations on Series may not be as fast
as could be.
For simpler types such as Double and Int, a different kind of series can be used to
speed up numerical calculations: Series from the Data.Series.Unboxed module.
Such Series are much more limited: they can only contain datatypes which are
instances of Unbox.
This then brings the question: how can you write software which supports both ordinary Series
and unboxed Series? The answer is to use functions from the Data.Series.Generic.
For example, we could implement the dot product of two series as:
>>>import qualified Data.Series.Generic as G>>>import Data.Vector.Generic ( Vector )>>>:{dot :: (Ord k, Num a, Vector v a) => G.Series v k a -> G.Series v k a -> a dot v1 v2 = G.sum $ G.zipWithMatched (*) v1 v2 :}
You can convert between the two types of series using the convert function.
Replacing values
map allows to map every value of a series. How about replacing *some*
values in a series? The function replace (and its infix variant, |->) replaces values in the right operand
which have an analogue in the left operand:
>>>import Data.Series ( (|->) )>>>let nan = (0/0) :: Double>>>let right = Series.fromList [('a', 1), ('b', nan), ('c', 3), ('d', nan)]>>>rightindex | values ----- | ------ 'a' | 1.0 'b' | NaN 'c' | 3.0 'd' | NaN>>>let left = Series.fromList [('b', 0::Double), ('d', 0), ('e', 0)]>>>leftindex | values ----- | ------ 'b' | 0.0 'd' | 0.0 'e' | 0.0>>>left |-> rightindex | values ----- | ------ 'a' | 1.0 'b' | 0.0 'c' | 3.0 'd' | 0.0
In the example above, the key 'e' is ignored since it was not in the right
series to begin with.
The flipped version, <-|, is also available.
Comparison with other data structures
Below is a table showing which operations on Data.Series have analogues for other data structures.
| Action | Data.Series | Data.Map.Strict | Data.List | Data.Vector |
|---|---|---|---|---|
| Mapping values | map | map | map | map |
| Mapping index | mapIndex | mapKeys | ||
| Mapping values with key | mapWithKey | mapWithKey | ||
| Filtering values | filter | filter | filter | filter |
| Filtering index | select,
filterWithKey | filterWithKey
| ||
| Indexing by key | at | lookup | ||
| Indexing by position | iat | ! | ! | |
| Combine two structures key-wise | zipWith | merge | ||
| Union | <> | union | union | |
| Group keys | groupBy |