Copyright	(c) Nikos Karagiannidis 2018
License	BSD3
Maintainer	nkarag@gmail.com
Stability	stable
Portability	POSIX
Safe Haskell	None
Language	Haskell2010

Etl.Julius

Contents

The Julius EDSL Syntax
Julius Expression Evaluation
Various ETL Operations

Description

Julius is a type-level Embedded Domain Specific Language (EDSL) for ETL/ELT data processing in Haskell. Julius enables us to express complex data transformation flows (i.e., an arbitrary combination of ETL operations) in a more friendly manner (a Julius Expression), with plain Haskell code (no special language for ETL scripting required). For more information read this Julius Tutorial.

When to use this module

This module should be used whenever one has "tabular data" (e.g., some CSV files, or any type of data that can be an instance of the RTabular type class and thus define the toRTable and fromRTable functions) and wants to analyze them in-memory with the well-known relational algebra operations (selection, projection, join, groupby, aggregations etc) that lie behind SQL. This data analysis takes place within your haskell code, without the need to import the data into a database (database-less data processing) and the result can be turned into the original format (e.g., CSV) with a simple call to the fromRTable function.

Etl.Julius provides a simple language for expressing all relational algebra operations and arbitrary combinations of them, and thus is a powerful tool for expressing complex data transfromations in Haskell. Moreover, the Julius language includes a clause for the Column Mapping (RColMapping) concept, which is a construct used in ETL tools and enables arbitrary transformations at the column level and the creation of derived columns based on arbitrary expressions on the existing ones. Finally, the ad hoc combination of relational operations and Column Mappings, chained in an data transformation flow, implements the concept of the ETL Mapping (ETLMapping), which is the core data mapping unit in all ETL tools and embeds all the "ETL-logic" for loading/creating a single target RTable from a set of source RTables. It is implemented in the ETL.Internal.Core module. For the relational algebra operations, Julius exploits the functions in the RTable.Core module, which also exports it.

The Julius EDSL is the recommended method for expressing ETL flows in Haskell, as well as doing any data analysis task within the DBFunctor package. Etl.Julius is a self-sufficient module and imports all neccesary functionality from RTable.Core and Etl.Internal.Core modules, so a programmer should only import Etl.Julius and nothing else, in order to have complete functionality.

Overview

The core data type in the Julius EDSL is the ETLMappingExpr. This data type creates a so-called Julius Expression. This Julius expression is the "Haskell equivalent" to the ETL Mapping concept discussed above. It is evaluated to an ETLMapping (see ETLMapping), which is our data structure for the internal representation of the ETL Mapping, with the evalJulius function and from then, evaluated into an RTable (see juliusToRTable), which is the final result of our transformation.

A Julius Expression is a chain of ETL Operation Expressions (EtlOpExpr) connected with the :-> constructor (or with the :=> constructor for named result operations - see below for an explanation) This chain of ETL Operations always starts with the EtlMapStart constructor and is executed from left-to-right, or from top-to-bottom:

EtlMapStart :-> <ETLOpExpr> :-> <ETLOpExpr> :-> ... :-> <ETLOpExpr>
-- equivalently
EtlMapStart :-> <ETLOpExpr> 
            :-> <ETLOpExpr> 
            :-> ... 
            :-> <ETLOpExpr>

A Named ETL Operation Expression (NamedMap) is just an ETL Operation with a name, so as to be able to reference this specific step in the chain of ETL Operations. It is actually a named intermediate result, which can reference and use in other parts of our Julius expression. It is similar in notion to a subquery, known as an INLINE VIEW, or better, it is equivalent to the WITH clause in SQL (i.e., also called subquery factoring in SQL parlance) For example:

EtlMapStart :-> <ETLOpExpr> 
            :=> NamedResult "my_intermdt_result" <ETLOpExpr> 
            :-> ... 
            :-> <ETLOpExpr>

An ETL Operation Expression (ETLOpExpr) - a.k.a. a Julius Expression - is either a Column Mapping Expression (ColMappingExpr), or a Relational Operation Expression (ROpExpr). The former is used in order to express a Column Mapping (i.e., an arbitrary transformation at the column level, with which we can create any derived column based on existing columns, see RColMapping) and the latter is Relational Operation (Selection, Projection, Join, Outer Join, Group By, Order By, Aggregate, or a generic Unary or Binary RTable operation, see ROperation)

Typical Structure of an ETL Program using Etl.Julius


import     Etl.Julius
import     RTable.Data.CSV     (CSV, readCSV, writeCSV, toRTable)

-- 1. Define table metadata
-- E.g.,
src_DBTab_MData :: RTableMData
src_DBTab_MData = 
    createRTableMData   (   "sourceTab"  -- table name
                            ,[  ("OWNER", Varchar)                                      -- Owner of the table
                                ,("TABLE_NAME", Varchar)                                -- Name of the table
                                ,("TABLESPACE_NAME", Varchar)                           -- Tablespace name
                                ,("STATUS",Varchar)                                     -- Status of the table object (VALID/IVALID)
                                ,("NUM_ROWS", Integer)                                  -- Number of rows in the table
                                ,("BLOCKS", Integer)                                    -- Number of Blocks allocated for this table
                                ,("LAST_ANALYZED", Timestamp "MMDDYYYY HH24:MI:SS")   -- Timestamp of the last time the table was analyzed (i.e., gathered statistics) 
                            ]
                        )
                        ["OWNER", "TABLE_NAME"] -- primary key
                        [] -- (alternative) unique keys    

-- Result RTable metadata
result_tab_MData :: RTableMData
result_tab_MData = ...

-- 2. Define your ETL code
-- E.g.,
myEtl :: [RTable] -> [RTable]
myEtl [rtab] = 
    -- 3. Define your Julius Expression(s)
    let jul = 
            EtlMapStart
            :-> (EtlR $
                    ROpStart  
                    :. (...)
            ...
    -- 4. Evaluate Julius to the Result RTable
    in [juliusToRTable jul]

main :: IO ()
main = do
    
    -- 5. read source csv files
    -- E.g.,
    srcCSV <- readCSV "./app/test-data.csv"

    -- 6. Convert CSV to an RTable and do your ETL
    [resultRTab] <- runETL myETL $ [toRTable src_DBTab_MData srcCSV]

    -- 7. Print your results on screen
    -- E.g.,
    printfRTable (genRTupleFormat ["OWNER", "TABLE_NAME","LAST_ANALYZED"] genDefaultColFormatMap) $ resultRTab

    -- 8. Save your result to a CSV file
    -- E.g.,
    writeCSV "./app/result-data.csv" $ 
                    fromRTable result_tab_MData resultRTab

-- 1.

We define the necessary RTable metadata, for each RTable in our program. This is equivalent to a CREATE TABLE ddl clause in SQL.

-- 2.

Here is where we define our ETL code. We dont want to do our ETL in the main function, so we separate the ETL code into a separate function (myETL). In general, in our main, we want to follow the pattern:

Read your Input (Extract phase)
Do your ETL (Transform phase)
Write your Output (Load phase)

This function receives as input a list with all the necessary Source RTables (in our case we have a single item list) and outputs a list with all the resulting (Target) RTable, after the all the necessary transformation steps have been executed. Of course an ETL code might produce more than one target RTables, e.g., a target schema (in DB parlance) and not just one as in our example. Moreover, the myETL function can be arbitrary complex, depending on the ETL logic that we want to implement in each case. It is essentially the entry point to our ETL implementation

-- 3.: Our ETL code in general will consist of an arbitrary number of Julius expressions. One can define multiple separate Julius expressions, some of which might depend on others, in order to implement the corresponding ETL logic. Keep in mind that each Julius expression encapsulates the "transformtioin logic" for producing a single target RTable. This holds, even if the target RTable is an intermediate result in the overall ETL process and not a final result RTable.

The evaluation of each individual Julius expression must be in conformance with the input-RTable prerequisites of each Julius expression. So, first we must evaluate all the Julius expressions that dont depend on other Julius expressions but only on source RTables. Then, we evaluate the Julius expressions that depend on the previous ones and so on.

-- 4.: In our case our ETL code consists of a single source RTable that produces a single target RTable. The Julius expression is evaluated into an RTable and returned to the caller of the ETL code (in our case this is main)
-- 5.: Here is where we read our input for the ETL. In our case, this is a simple CSV file that we read with the help of the readCSV function.
-- 6.: We convert our input CSV to an RTable, with the toRTable and pass it as input to our ETL code. We execute our ETL code with the runETL function.
-- 7.: We print our target RTable on screen using the printfRTable function for formatted printed (printf like) of RTables.
-- 8.: We save our target RTable to a CSV file with the fromRTable function.

Simple Julius Expression Examples

Note: Julius Expression are read from top to bottom and from left to right.

Selection (i.e., Filter)

SQL

SELECT * 
FROM expenses exp 
WHERE   exp.category = 'FOOD:SUPER_MARKET' 
        AND exp.amount > 50.00

Julius

juliusToRTable $           
    EtlMapStart
    :-> (EtlR $
            ROpStart  
            :. (Filter (From $ Tab expenses) $ FilterBy myFpred))

myFpred :: RPredicate
myFpred = \t ->    t <!> "category" == "FOOD:SUPER_MARKET" 
                    && 
                    t <!> "amount" > 50.00

Projection

SQL

SELECT "TxTimeStamp", "Category", "Description", "Amount" 
FROM expenses exp 
WHERE   exp.category = 'FOOD:SUPER_MARKET' 
        AND exp.amount > 50.00

Julius

juliusToRTable $           
    EtlMapStart
    :-> (EtlR $
            ROpStart  
            :. (Filter (From $ Tab expenses) $ FilterBy myFpred)
            :. (Select ["TxTimeStamp", "Category","Description", "Amount"] $ From Previous))

myFpred :: RPredicate
myFpred = \t -> t <!> "category" == "FOOD:SUPER_MARKET" 
                && 
                t <!> "amount" > 50.00

Sorting (Order By)

SQL

SELECT "TxTimeStamp", "Category", "Description", "Amount" 
FROM expenses exp 
WHERE   exp.category = 'FOOD:SUPER_MARKET' 
        AND exp.amount > 50.00
ORDER BY "TxTimeStamp" DESC

Julius

juliusToRTable $           
    EtlMapStart
    :-> (EtlR $
            ROpStart  
            :. (Filter (From $ Tab expenses) $ FilterBy myFpred)
            :. (Select ["TxTimeStamp", "Category","Description", "Amount"] $ From Previous)
            :. (OrderBy [("TxTimeStamp", Desc)] $ From Previous))

myFpred :: RPredicate
myFpred = \t -> t <!> "category" == "FOOD:SUPER_MARKET" 
                && 
                t <!> "amount" > 50.00

Grouping and Aggregation

SQL

SELECT "Category",  sum("Amount") AS "TotalAmount"
FROM expenses exp 
GROUP BY  "Category"
ORDER BY "TotalAmount" DESC

Julius

juliusToRTable $           
    EtlMapStart
    :-> (EtlR $
            ROpStart  
            :. (GroupBy ["Category"]  
                        (AggOn [Sum "Amount" $ As "TotalAmount"] (From $ Tab expenses)) $ 
                        GroupOn  (t1 t2 -> t1 <!> "Category" == t2 <!> "Category")                                                                         
                )
            :. (OrderBy [ ("TotalAmount", Desc)] $ From Previous))

Group By and then Right Outer Join

First group the expenses table by category of expenses and then do a right outer join with the budget table, in order to pair the expenses at the category level with the correpsonding budget amount (if there is one). Preserve the rows from the expenses table.

SQL

WITH exp
as (
    SELECT "Category",  sum("Amount") AS "TotalAmount"
    FROM expenses
    GROUP BY  "Category"
)
SELECT  exp."Category", exp."TotalAmount", bdg."YearlyBudget"
FROM budget bdg RIGHT JOIN exp ON (bdg."Category" = exp."Category")
ORDER BY exp."TotalAmount" DESC

Julius

juliusToRTable $           
    EtlMapStart
    :-> (EtlR $
            ROpStart  
            :. (GroupBy ["Category"]  
                        (AggOn [Sum "Amount" $ As "TotalAmount"] (From $ Tab expenses)) $ 
                        GroupOn  (t1 t2 -> t1 <!> "Category" == t2 <!> "Category")                                                                         
                )
                -- >>> A Right Outer Join that preserves the Previous result RTuples and joins with the budget table
            :. (RJoin (TabL budget) Previous $ 
                        JoinOn (tl tr ->                                             
                                    tl <!> "Category" == tr <!> "Category")
                                )
                )
            :. (OrderBy [ ("TotalAmount", Desc)] $ From Previous))

A Column Mapping

We will use the previous result (i.e., a table with expenses and budget amounts per category of expenses), in order to create a derived column to calculate the residual amount, with the help of a Column Mapping Expression (ColMappingExpr).

SQL

WITH exp
as (
    SELECT "Category",  sum("Amount") AS "TotalAmount"
    FROM expenses
    GROUP BY  "Category"
)
SELECT  exp."Category", exp."TotalAmount", bdg."YearlyBudget", bdg."YearlyBudget" - exp."TotalAmount" AS "ResidualAmount"
FROM budget bdg RIGHT JOIN exp ON (bdg."Category" = exp."Category")
ORDER BY exp."TotalAmount" DESC

Julius

juliusToRTable $           
    EtlMapStart
    :-> (EtlR $
            ROpStart  
            :. (GroupBy ["Category"]  
                        (AggOn [Sum "Amount" $ As "TotalAmount"] (From $ Tab expenses)) $ 
                        GroupOn  (t1 t2 -> t1 <!> "Category" == t2 <!> "Category")                                                                         
                )
            :. (RJoin (TabL budget) Previous $ 
                        JoinOn (tl tr ->                                             
                                    tl <!> "Category" == tr <!> "Category")
                                )
                )
            :. (Select ["Category",  "TotalAmount", "YearlyBudget"] $ From Previous)
            :. (OrderBy [ ("TotalAmount", Desc)] $ From Previous)
        )
        -- >>> A Column Mapping to create a derived column ("ResidualAmount"")
    :-> (EtlC $
            Source ["TotalAmount", "YearlyBudget"] $
            Target ["ResidualAmount"] $
            By ([totAmount, yearlyBudget] -> [yearlyBudget - totAmount]) 
                (On Previous) DontRemoveSrc $ FilterBy (t -> True)
        )

Naming Intermediate Results in a Julius Expression

In the following example, each named result is an autonomous (intermediate) result, that can be accessed directly and we can handle it as a distinct result (i.e.,RTable). So we can refer to such a result in a subsequent expression, or we can print this result separately, with the printRTable function etc. Each such named result resembles a separate subquery in an SQL WITH clause.

SQL

WITH detailYearTXTab
as (
    SELECT "TxTimeStamp", "Category", "Description", "Amount","DebitCredit"
    FROM txTab
    WHERE   to_number(to_char("TxTimeStamp", YYYY)) >= yearInput 
            AND 
            to_number(to_char("TxTimeStamp", YYYY)) < yearInput + 1
),
expGroupbyCategory
as (
    SELECT "Category", sum ("Amount") AS "AmountSpent"
    FROM detailYearTXTab
    WHERE
        "DebitCredit" = "D"
    GROUP BY "Category"
    ORDER BY 2 DESC
),
revGroupbyCategory
as (
    SELECT "Category", sum("Amount") AS "AmountReceived"
    FROM detailYearTXTab
    WHERE
        "DebitCredit" = "C"
    GROUP BY "Category"
    ORDER BY 2 DESC
),
ojoinedWithBudget
as(
    SELECT "Category", "AmountSpent", YearlyBudget"
    FROM budget bdg RIGHT JOIN expGroupbyCategory exp ON (bdg."Category" = exp."Category")
),
calculatedFields
as(
    SELECT "Category", "AmountSpent", "YearlyBudget", "YearlyBudget" - "AmountSpent" AS "ResidualAmount" 
    FROM ojoinedWithBudget
)
SELECT  *
FROM calculatedFields

Julius

let
    julExpr = 
        -- 1. get detailed transactions of the year
        :=> NamedResult "detailYearTXtab" (EtlR $
                ROpStart
                -- keep RTuples only of the specified year
                :. (Filter (From $ Tab txTab) $    
                        FilterBy (t ->       rtime (t <!> "TxTimeStamp") >= 
                                                                        RTimestampVal {year = yearInput, month = 1, day = 1, hours24 = 0, minutes = 0, seconds = 0}
                                                                    &&
                                                                        rtime (t <!> "TxTimeStamp") < 
                                                                        RTimestampVal { year = yearInput + 1, 
                                                                                        month = 1,
                                                                                        day = 1, hours24 = 0, minutes = 0, seconds = 0})
                    )
                -- keep only columns of interest
                :. (Select ["TxTimeStamp", "Category", "Description", "Amount","DebitCredit"] $ From Previous)
                :. (OrderBy [("TxTimeStamp", Asc)] $ From Previous)
            )
        -- 2. expenses group by category
        :=> NamedResult "expGroupbyCategory" (EtlR $
                ROpStart
                    -- keep only the "debit" transactions
                :. (FilterBy    (From Previous) $
                                FilterBy (t -> t <!> "DebitCredit" == "D")
                    )                
                :. (GroupBy ["Category"] 
                            (AggOn [Sum "Amount" $ As "AmountSpent" ]  $ From Previous) $ 
                            GroupOn (t1 t2 ->  t1 <!> "Category" == t2 <!> "Category")
                    )
                :. (OrderBy [("AmountSpent", Desc)] $ From Previous)
            )
        -- 3. revenues group by category
        :=> NamedResult "revGroupbyCategory" (EtlR $
                ROpStart
                    -- keep only the "credit" transactions
                :. (FilterBy    (From $ juliusToRTable $ takeNamedResult "detailYearTXtab" julExpr) $
                                FilterBy (t -> t <!> "DebitCredit" == "C")
                    )                
                :. (GroupBy ["Category"] 
                            (AggOn [Sum "Amount" $ As "AmountReceived" ]  $ From Previous) $ 
                            GroupOn (t1 t2 ->  t1 <!> "Category" == t2 <!> "Category")
                    )
                :. (OrderBy [("AmountReceived", Desc)] $ From Previous)
            )
        -- 3. Expenses Group By Category Outer joined with budget info
        :=> NamedResult "ojoinedWithBudget" (EtlR $
                ROpStart
                :. (RJoin (TabL budget) (Tab $ juliusToRTable $ takeNamedResult "expGroupbyCategory" julExpr) $ 
                            JoinOn (tl tr ->                                             
                                        tl <!> "Category" == tr <!> "Category")
                                    )
                    )
                :. (Select ["Category",  "AmountSpent", "YearlyBudget"] $ From Previous)
                :. (OrderBy [ ("TotalAmount", Desc)] $ From Previous)
            )
        -- 4. A Column Mapping to create a derived column ("ResidualAmount")
        :=> NamedResult "calculatedFields"  (EtlC $
                Source ["AmountSpent", "YearlyBudget"] $
                Target ["ResidualAmount"] $
                By ([amountSpent, yearlyBudget] -> [yearlyBudget - amountSpent]) 
                    (On Previous) DontRemoveSrc $ FilterBy (t -> True)
            )

-- 5. Print detail transactions
printRTable $ juliusToRTable $ takeNamedResult "detailYearTXtab" julExpr

-- 6. Print Expenses by Category
printRTable $ juliusToRTable $ takeNamedResult "expGroupbyCategory" julExpr

-- 7. Print Expenses with Budgeting Info and Residual Amount
printRTable $ juliusToRTable $ takeNamedResult "calculatedFields" julExpr

-- equivalently
printRTable $ juliusToRTable julExpr

-- 8. Print Revenues by Category
printRTable $ juliusToRTable $ takeNamedResult "revGroupbyCategory" julExpr

Explanation of each named result in the above example:

"detailYearTXtab": We retrieve the detail transactions of the current year only and project the columns of interest. We order result by transaction timestamp.
"expGroupbyCategory": We filter the previous result ("detailYearTXtab"), in order to get only the transactions corresponding to expenses (t <!> "DebitCredit" == "D") and then we group by category, in order to get a total amount spent by category of expenses.
"revGroupbyCategory": We filter the "detailYearTXtab" result (see the use of the takeNamedResult function in order to access the "detailYearTXtab" intermediate result), in order to get only the transactions corresponding to revenues (t <!> "DebitCredit" == "C") and then we group by category, in order to get a total amount received by category of revenues.
"ojoinedWithBudget": We do a right join of the grouped expenses with the budget table. Again note the use of the takeNamedResult function. The preserved RTable is the grouped expenses ("expGroupbyCategory").
"calculatedFields": Finally, we use a Column Mapping in order to create a derived column, which holds the residual amount for each category of expenses.

Synopsis

module RTable.Core
data ETLMappingExpr
- = EtlMapStart
- | ETLMappingExpr :-> ETLOpExpr
- | ETLMappingExpr :=> NamedMap
data ETLOpExpr
- = EtlC ColMappingExpr
- | EtlR ROpExpr
data NamedMap = NamedResult NamedResultName ETLOpExpr
type NamedResultName = String
data ColMappingExpr
- = Source [ColumnName] ToColumn
- | ColMappingEmpty
data ToColumn = Target [ColumnName] ByFunction
data ByFunction = By ColXForm OnRTable RemoveSrcCol ByPred
data OnRTable = On TabExpr
data TabExpr
- = Tab RTable
- | Previous
data RemoveSrcCol
- = RemoveSrc
- | DontRemoveSrc
data ByPred = FilterBy RPredicate
data ByDelPred = Where RPredicate
data SetColumns = Set [(ColumnName, RDataType)]
data ROpExpr
- = ROpStart
- | ROpExpr :. RelationalOp
data RelationalOp
- = Filter FromRTable ByPred
- | Select [ColumnName] FromRTable
- | Agg Aggregate
- | GroupBy [ColumnName] Aggregate GroupOnPred
- | Join TabLiteral TabExpr TabExprJoin
- | LJoin TabLiteral TabExpr TabExprJoin
- | RJoin TabLiteral TabExpr TabExprJoin
- | FOJoin TabLiteral TabExpr TabExprJoin
- | SemiJoin TabLiteral TabExpr TabExprJoin
- | SemiJoinP TabExpr TabLiteral TabExprJoin
- | AntiJoin TabLiteral TabExpr TabExprJoin
- | AntiJoinP TabExpr TabLiteral TabExprJoin
- | TabLiteral `Intersect` TabExpr
- | TabLiteral `Union` TabExpr
- | TabLiteral `UnionAll` TabExpr
- | TabLiteral `Minus` TabExpr
- | TabExpr `MinusP` TabLiteral
- | GenUnaryOp OnRTable ByGenUnaryOperation
- | GenBinaryOp TabLiteral TabExpr ByGenBinaryOperation
- | OrderBy [(ColumnName, OrderingSpec)] FromRTable
- | Delete FromRTable ByDelPred
- | Update TabExpr SetColumns ByPred
- | Insert IntoClause
- | Upsert MergeInto
data FromRTable = From TabExpr
data Aggregate = AggOn [AggOp] FromRTable
data AggOp
- = Sum ColumnName AsColumn
- | Count ColumnName AsColumn
- | CountDist ColumnName AsColumn
- | CountStar AsColumn
- | Min ColumnName AsColumn
- | Max ColumnName AsColumn
- | Avg ColumnName AsColumn
- | StrAgg ColumnName AsColumn Delimiter
- | GenAgg ColumnName AsColumn AggBy
data AsColumn = As ColumnName
data AggBy = AggBy AggFunction
data GroupOnPred = GroupOn RGroupPredicate
data TabLiteral = TabL RTable
data TabExprJoin = JoinOn RJoinPredicate
data ByGenUnaryOperation = ByUnaryOp UnaryRTableOperation
data ByGenBinaryOperation = ByBinaryOp BinaryRTableOperation
data IntoClause = Into TabExpr InsertSource
data InsertSource
- = Values ValuesClause
- | RTuples TabSource
type ValuesClause = [(ColumnName, RDataType)]
data TabSource = TabSrc RTable
data MergeInto = MergeInto TabExpr MergeSource
data MergeSource = Using TabSource MergeMatchCondition
data MergeMatchCondition = MergeOn RUpsertPredicate WhenMatched
data WhenMatched = WhenMatchedThen UpdateColumns
data UpdateColumns = UpdateCols [ColumnName] ByPred
evalJulius :: ETLMappingExpr -> ETLMapping
juliusToRTable :: ETLMappingExpr -> RTable
runJulius :: ETLMappingExpr -> IO RTable
eitherRunJulius :: Exception e => ETLMappingExpr -> IO (Either e RTable)
juliusToResult :: ETLMappingExpr -> RTabResult
runJuliusToResult :: ETLMappingExpr -> IO RTabResult
eitherRunJuliusToResult :: Exception e => ETLMappingExpr -> IO (Either e RTabResult)
runETL :: ([RTable] -> [RTable]) -> [RTable] -> IO [RTable]
eitherRunETL :: Exception e => ([RTable] -> [RTable]) -> [RTable] -> IO (Either e [RTable])
takeNamedResult :: NamedResultName -> ETLMappingExpr -> ETLMappingExpr
addSurrogateKeyJ :: Integral a => ColumnName -> a -> RTable -> RTable
appendRTableJ :: RTable -> RTable -> RTable
addSurrogateKey :: Integral a => ColumnName -> a -> ETLOperation
appendRTable :: ETLOperation

Documentation

module RTable.Core

The Julius EDSL Syntax

The Julius Expression

data ETLMappingExpr Source #

An ETL Mapping Expression is a "Julius Expression". It is a sequence of individual ETL Operation Expressions. Each such ETL Operation "acts" on some input RTable and produces a new "transformed" output RTable. The ETL Mapping connector :-> (as well as the :=> connector) is left associative because in a ETLMappingExpr operations are evaluated from left to right (or top to bottom) A Named ETL Operation Expression (NamedMap) is just an ETL Operation with a name, so as to be able to reference this specific step in the chain of ETL Operations. It is actually a named intermediate result, which we can reference and use in other parts of our Julius expression

Constructors

EtlMapStart
ETLMappingExpr :-> ETLOpExpr infixl 5
ETLMappingExpr :=> NamedMap infixl 5

data ETLOpExpr Source #

An ETL Operation Expression is either a Column Mapping Expression (ColMappingExpr), or a Relational Operation Expression (ROpExpr)

Constructors

EtlC ColMappingExpr
EtlR ROpExpr

data NamedMap Source #

A named intermediate result in a Julius expression (ETLMappingExpr), which we can access via the takeNamedResult function.

Constructors

NamedResult NamedResultName ETLOpExpr

type NamedResultName = String Source #

The name of an intermediate result, used as a key for accessing this result via the takeNamedResult function.

The Column Mapping Clause

data ColMappingExpr Source #

A Column Mapping (RColMapping) is the main ETL/ELT construct for defining a column-level transformation. Essentially with a Column Mapping we can create one or more new (derived) column(s) (Target Columns), based on an arbitrary transformation function (ColXForm) with input parameters any of the existing columns (Source Columns). So a ColMappingExpr is either empty, or it defines the source columns, the target columns and the transformation from source to target. Notes: * If a target-column has the same name with a source-column and a DontRemoveSrc, or a RemoveSrc has been specified, then the (target-column, target-value) key-value pair, overwrites the corresponding (source-column, source-value) key-value pair * The returned RTable will include only the RTuples that satisfy the filter predicate specified in the FilterBy clause.

Constructors

Source [ColumnName] ToColumn
ColMappingEmpty

data ToColumn Source #

Defines the Target Columns of a Column Mapping Expression (ColMappingExpr) and the column transformation function (ColXForm).

Constructors

Target [ColumnName] ByFunction

data ByFunction Source #

Defines the column transformation function of a Column Mapping Expression (ColMappingExpr), the input RTable that this transformation will take place, an indicator (RemoveSrcCol) of whether the Source Columns will be removed or not in the new RTable that will be created after the Column Mapping is executed and finally, an RTuple filter predicate (ByPred) that defines the subset of RTuples that this Column Mapping will be applied to. If it must be applied to all RTuples, then for the last parameter (ByPred), we can just provide the following RPredicate:

FilterBy (\_ -> True)

Constructors

By ColXForm OnRTable RemoveSrcCol ByPred

data OnRTable Source #

Defines the RTable that the current operation will be applied to.

Constructors

On TabExpr

data TabExpr Source #

A Table Expression defines the RTable on which the current ETL Operation will be applied. If the Previous constructor is used, then this RTable is the result of the previous ETL Operations in the current Julius Expression (ETLMappingExpr)

Constructors

Tab RTable
Previous

data RemoveSrcCol Source #

Indicator of whether the source column(s) in a Column Mapping will be removed or not (used in ColMappingExpr) If a target-column has the same name with a source-column and a DontRemoveSrc has been specified, then the (target-column, target-value) key-value pair, overwrites the corresponding (source-column, source-value) key-value pair.

Constructors

RemoveSrc
DontRemoveSrc

data ByPred Source #

An RTuple predicate clause.

Constructors

FilterBy RPredicate

data ByDelPred Source #

Predicate for Deletion Operation

Constructors

Where RPredicate

data SetColumns Source #

The Set sub-clause of an Update RTable clause. It specifies each column to be updated along with the new value.

Constructors

Set [(ColumnName, RDataType)]

The Relational Operation Clause

data ROpExpr Source #

A Relational Operation Expression (ROpExpr) is a sequence of one or more Relational Algebra Operations applied on a input RTable. It is a sub-expression within a Julius Expression (ETLMappingExpr) and we use it whenever we want to apply relational algebra operations on an RTable (which might be the result of previous operations in a Julius Expression). A Julius Expression (ETLMappingExpr) can contain an arbitrary number of ROpExprs. The relational operation connector :. is left associative because in a ROpExpr operations are evaluated from left to right (or top to bottom).

Constructors

ROpStart
ROpExpr :. RelationalOp infixl 6

data RelationalOp Source #

The Relational Operation (RelationalOp) is a Julius clause that represents a Relational Algebra Operation.

Constructors

Filter FromRTable ByPred	`RTuple` filtering clause (selection operation), based on an arbitrary predicate function (`RPredicate`)
Select [ColumnName] FromRTable	Column projection clause
Agg Aggregate	Aggregate Operation clause
GroupBy [ColumnName] Aggregate GroupOnPred	Group By clause, based on an arbitrary Grouping predicate function (`RGroupPredicate`)
Join TabLiteral TabExpr TabExprJoin	Inner Join clause, based on an arbitrary join predicate function - not just equi-join - (`RJoinPredicate`)
LJoin TabLiteral TabExpr TabExprJoin	Left Join clause, based on an arbitrary join predicate function - not just equi-join - (`RJoinPredicate`)
RJoin TabLiteral TabExpr TabExprJoin	Right Join clause, based on an arbitrary join predicate function - not just equi-join - (`RJoinPredicate`)
FOJoin TabLiteral TabExpr TabExprJoin	Full Outer Join clause, based on an arbitrary join predicate function - not just equi-join - (`RJoinPredicate`)
SemiJoin TabLiteral TabExpr TabExprJoin	Implements the semi-Join operation between two RTables (any type of join predicate is allowed) It returns the `RTuple`s from the left `RTable` that match with the right `RTable`. Note that if an `RTuple` from the left `RTable` matches more than one `RTuple`s from the right `RTable` the semi join operation will return only a single `RTuple`.
SemiJoinP TabExpr TabLiteral TabExprJoin
AntiJoin TabLiteral TabExpr TabExprJoin	Implements the anti-Join operation between two RTables (any type of join predicate is allowed) It returns the `RTuple`s from the left `RTable` that DONT match with the right `RTable`.
AntiJoinP TabExpr TabLiteral TabExprJoin
TabLiteral `Intersect` TabExpr	Intersection clause
TabLiteral `Union` TabExpr	Union clause. Note this operation eliminates dublicate `RTuples`
TabLiteral `UnionAll` TabExpr	Union All clause. It is a Union operation without dublicate `RTuple` elimination.
TabLiteral `Minus` TabExpr	Minus clause (set Difference operation)
TabExpr `MinusP` TabLiteral
GenUnaryOp OnRTable ByGenUnaryOperation	This is a generic unary operation on a RTable (`UnaryRTableOperation`). It is used to define an arbitrary unary operation on an `RTable`
GenBinaryOp TabLiteral TabExpr ByGenBinaryOperation	This is a generic binary operation on a RTable (`BinaryRTableOperation`). It is used to define an arbitrary binary operation on an `RTable`
OrderBy [(ColumnName, OrderingSpec)] FromRTable	Order By clause.
Delete FromRTable ByDelPred	Delete operation. Deletes the `RTuple`s from an `RTable` based on an `RPredicate`. Please note that this is an immutable implementation of an `RTable` update. This simply means that the delete operation returns a new `RTable`. So, the original `RTable` remains unchanged and no deletion in-place takes place whatsoever. Moreover, if we have multiple threads deleting an `RTable`, due to immutability, each thread "sees" its own copy of the `RTable` and thus there is no need for locking the deleted `RTuple`s, as happens in a common RDBMS.
Update TabExpr SetColumns ByPred	Update an `RTable`. Please note that this is an immutable implementation of an `RTable` update. This simply means that the update operation returns a new `RTable` that includes all the `RTuple`s of the original `RTable`, both the ones that have been updated and the others that have not. So, the original `RTable` remains unchanged and no update in-place takes place whatsoever. Moreover, if we have multiple threads updating an `RTable`, due to immutability, each thread "sees" its own copy of the `RTable` and thus there is no need for locking the updated `RTuple`s, as happens in a common RDBMS.
Insert IntoClause	Insert Operation. It can insert into an `RTable` a single `RTuple` or a whole `RTable`. The latter is the equivalent of an `INSERT INTO SELECT` clause in SQL. Since, an `RTable` can be the result of a Julius expression (playing the role of a subquery within the Insert clause, in this case). Please note that this is an immutable implementation of an `RTable` insert. This simply means that the insert operation returns a new `RTable` and does not affect the original `RTable`. Also note that the source and target `RTable`s should have the same structure. By "structure", we mean that the `ColumnName`s and the corresponding data types must match. Essentially what we record in the `ColumnInfo` must be the same for the two `RTable`s. Otherwise a `ConflictingRTableStructures` exception will be thrown.
Upsert MergeInto	Upsert (Update+Insert, aka Merge) Operation. We provide a source `RTable` and a matching condition (`RUpsertPredicate`) to the `RTuple`s of the target `RTable`. An `RTuple` from the target `RTable` might match to a single only `RTuple` in the source `RTable`, or not match at all. If it is matched to more than one `RTuple`s then an exception (`UniquenessViolationInUpsert`)is thrown. When an `RTuple` from the target `RTable` is matched to a source `RTuple`, then the corresponding columns of the target `RTuple` are updated with the new values provided in the source `RTuple`. This takes place for the target `RTuple`s that match but also that satisfy the input `RPredicate`. Thus we can restrict further with a filter the `RTuple`s of the target `RTable` where the update will take place. Finally, the source `RTuple`s that did not match to the target `RTable`, are inserted (appended) to the target `RTable` Please note that this is an immutable implementation of an `RTable` upsert. This simply means that the upsert operation returns a new `RTable` and does not affect the original `RTable`. Also note that the source and target `RTable`s should have the same structure. By "structure", we mean that the `ColumnName`s and the corresponding data types must match. Essentially what we record in the `ColumnInfo` must be the same for the two `RTable`s. Otherwise a `ConflictingRTableStructures` exception will be thrown. An Example: Source RTable: srcTab = Id \| Msg \| Other ----\|---------------\|------- 1 \| "updated" \|"a" 2 \| "world2" \|"a" 3 \| "inserted" \|"a" Target RTable: trgTab = Id \| Msg \| Other ----\|---------------\|------- 1 \| "hello1" \|"b" 2 \| "world1" \|"b" 4 \| "old" \|"b" 5 \| "hello" \|"b" juliusToRTable $ EtlMapStart :-> (EtlR $ ROpStart :.(Upsert $ MergeInto (Tab trgTab) $ Using (TabSrc srcTab) $ MergeOn (RUpsertPredicate ["Id"] (\t1 t2 -> t1 <!> "Id" == t2 <!> "Id")) $ -- merge condition: srcTab.Id == trgTab.Id WhenMatchedThen $ UpdateCols ["Msg"] $ FilterBy (\t -> let msg = case toText (t <!> "Msg") of Just t -> t Nothing -> pack "" in (take 5 msg) == (pack "hello") ) -- Msg like "hello%" ) ) Result RTable: Id \| Msg \| Other ----\|---------------\|------- 1 \| "updated" \|"b" -- Updated RTuple. Note that only column "Msg" has been overwritten, as per the UpdateCols subclause 2 \| "world1" \|"b" -- Not affected due to FilterBy predicate 3 \| "inserted" \|"a" -- Inserted RTuple 4 \| "old" \|"b" -- Not affected due to MergeOn condition 5 \| "hello" \|"b" -- Not affected due to MergeOn condition

data FromRTable Source #

Resembles the "FROM" clause in SQL. It defines the RTable on which the Relational Operation will be applied

Constructors

From TabExpr

data Aggregate Source #

An Aggregate Operation Clause

Constructors

AggOn [AggOp] FromRTable

data AggOp Source #

These are the available aggregate operation clauses

Constructors

Sum ColumnName AsColumn
Count ColumnName AsColumn	Count aggregation (no distinct)
CountDist ColumnName AsColumn	Count distinct aggregation (i.e., `count(distinct col)` in SQL). Returns the distinct number of values for this column.
CountStar AsColumn	Returns the number of `RTuple`s in the `RTable` (i.e., `count(*)` in SQL)
Min ColumnName AsColumn
Max ColumnName AsColumn
Avg ColumnName AsColumn	Average aggregation
StrAgg ColumnName AsColumn Delimiter	String aggregation
GenAgg ColumnName AsColumn AggBy	A custom aggregate operation

data AsColumn Source #

Defines the name of the column that will hold the aggregate operation result. It resembles the "AS" clause in SQL.

Constructors

As ColumnName

data AggBy Source #

Julius Clause to provide a custom aggregation function

Constructors

AggBy AggFunction

data GroupOnPred Source #

A grouping predicate clause. It defines an arbitrary function (RGroupPRedicate), which drives when two RTuples should belong in the same group.

Constructors

GroupOn RGroupPredicate

data TabLiteral Source #

This clause is used for expressions where we do not allow the use of the Previous value

Constructors

TabL RTable

data TabExprJoin Source #

Join Predicate Clause. It defines when two RTuples should be paired.

Constructors

JoinOn RJoinPredicate

data ByGenUnaryOperation Source #

It is used to define an arbitrary unary operation on an RTable

Constructors

ByUnaryOp UnaryRTableOperation

data ByGenBinaryOperation Source #

It is used to define an arbitrary binary operation on an RTable

Constructors

ByBinaryOp BinaryRTableOperation

data IntoClause Source #

Insert Into subclause

Constructors

Into TabExpr InsertSource

data InsertSource Source #

Subclause on Insert clause. Defines the source of the insert operation. The Values branch is used for inserting a singl RTuple, while the RTuples branch is used for inserting a whole RTable, typically derived as the result of a Julius expression. The former is similar in concept with an INSERT INTO VALUES SQL clause, and the latter is similar in concept with an INSERT INTO SELECT SQL clause.

Constructors

Values ValuesClause
RTuples TabSource

type ValuesClause = [(ColumnName, RDataType)] Source #

Subclause on Insert clause. Defines the source RTuple of the insert operation.

data TabSource Source #

This subclause refers to the source RTable that will feed an Insert operation

Constructors

TabSrc RTable

data MergeInto Source #

Merge Into subclause

Constructors

MergeInto TabExpr MergeSource

data MergeSource Source #

Upsert source subclause (Using clause in SQL)

Constructors

Using TabSource MergeMatchCondition

data MergeMatchCondition Source #

Upsert matching condition subclause

Constructors

MergeOn RUpsertPredicate WhenMatched

data WhenMatched Source #

When Matched subclause of Upsert

Constructors

WhenMatchedThen UpdateColumns

data UpdateColumns Source #

Update columns subclause of Upsert

Constructors

UpdateCols [ColumnName] ByPred

Julius Expression Evaluation

evalJulius :: ETLMappingExpr -> ETLMapping Source #

Evaluates (parses) the Julius exrpession and produces an ETLMapping. The ETLMapping is an internal representation of the Julius expression and one needs to combine it with the etl function, in order to evaluate the Julius expression into an RTable. This can be achieved directly with function juliusToRTable

juliusToRTable :: ETLMappingExpr -> RTable Source #

Pure code to evaluate the "ETL-logic" of a Julius expression and generate the corresponding target RTable.

The evaluation of a Julius expression (i.e., a ETLMappingExpr) to an RTable is strict. It evaluates fully to Normal Form (NF) as opposed to a lazy evaluation (i.e., only during IO), or evaluation to a WHNF. This is for efficiency reasons (e.g., avoid space leaks and excessive memory usage). It also has the impact that exceptions will be thrown at the same line of code that juliusToRTable is called. Thus one should wrap this call with a catch handler, or use eitherPrintRTable, or eitherPrintfRTable, if one wants to handle the exception gracefully.

Example:

do 
 catch (printRTable $ juliusToRTable $ <a Julius expression> )
       (\e -> putStrLn $ "There was an error in the Julius evaluation: " ++ (show (e::SomeException)) )

Or, similarly

do 
 p <- (eitherPrintRTable  printRTable $
                          juliusToRTable $ <a Julius expression>                                 
      ) :: IO (Either SomeException ())
 case p of
           Left exc -> putStrLn $ "There was an error in the Julius evaluation: " ++ (show exc)
           Right _  -> return ()

runJulius :: ETLMappingExpr -> IO RTable Source #

Evaluate a Julius expression within the IO Monad. I.e., Effectful code to evaluate the "ETL-logic" of a Julius expression and generate the corresponding target RTable.

The evaluation of a Julius expression (i.e., a ETLMappingExpr) to an RTable is strict. It evaluates fully to Normal Form (NF) as opposed to a lazy evaluation (i.e., only during IO), or evaluation to a WHNF. This is for efficiency reasons (e.g., avoid space leaks and excessive memory usage). It also has the impact that exceptions will be thrown at the same line of code that runJulius is called. Thus one should wrap this call with a catch handler, or use eitherRunJulius, if he wants to handle the exception gracefully.

Example:

do 
    result <- catch (runJulius $  <a Julius expression>)
                    (e -> do 
                             putStrLn $ "there was an error in Julius evaluation: " ++ (show (e::SomeException))
                             return emptyRTable
                    )

eitherRunJulius :: Exception e => ETLMappingExpr -> IO (Either e RTable) Source #

Evaluate a Julius expression and return the corresponding target RTable or an exception. One can define custom exceptions to be thrown within a Julius expression. This function will catch any exceptions that are instances of the Exception type class.

The evaluation of a Julius expression (i.e., a ETLMappingExpr) to an RTable is strict. It evaluates fully to Normal Form (NF) as opposed to a lazy evaluation (i.e., only during IO), or evaluation to a WHNF. This is for efficiency reasons (e.g., avoid space leaks and excessive memory usage).

Example:

do 
    res <- (eitherRunJulius $ <a Julius expression>) :: IO (Either SomeException RTable) 
    resultRTab  <- case res of
                    Right t  -> return t
                    Left exc ->  do 
                                    putStrLn $ "there was an error in Julius evaluation: " ++ (show exc)
                                    return emptyRTable

juliusToResult :: ETLMappingExpr -> RTabResult Source #

Receives an input Julius expression, evaluates it to an ETL Mapping (ETLMapping) and executes it, in order to return an RTabResult containing an RTable storing the result of the ETL Mapping, as well as the number of RTuples returned

runJuliusToResult :: ETLMappingExpr -> IO RTabResult Source #

Evaluate a Julius expression within the IO Monad and return an RTabResult.

eitherRunJuliusToResult :: Exception e => ETLMappingExpr -> IO (Either e RTabResult) Source #

Evaluate a Julius expression within the IO Monad and return either an RTabResult, or an exception, in case of an error during evaluation.

runETL :: ([RTable] -> [RTable]) -> [RTable] -> IO [RTable] Source #

Generic ETL execution function. It receives a list of input (aka "source") RTables and an ETL function that produces a list of output (aka "target") RTables. The ETL function should embed all the "transformation-logic" from the source RTables to the target RTables.

eitherRunETL :: Exception e => ([RTable] -> [RTable]) -> [RTable] -> IO (Either e [RTable]) Source #

Generic ETL execution function that returns either the target list of RTables, or an exception in case of a problem during the ETL code execution. It receives a list of input (aka "source") RTables and an ETL function that produces a list of output (aka "target") RTables. The ETL function should embed all the "transformation-logic" from the source RTables to the target RTables.

takeNamedResult Source #

Arguments

:: NamedResultName	the name of the intermediate result
-> ETLMappingExpr	input ETLMapping Expression
-> ETLMappingExpr	output ETLMapping Expression

Returns a prefix of an ETLMappingExpr that matches a named intermediate result. For example, below we show a Julius expression where we define an intermediate named result called "myResult". This result, is used at a later stage in this Julius expression, with the use of the function takeNamedResult.

       etlXpression = 
                   EtlMapStart
                   :-> (EtlC $ ...)                
                   :=> NamedResult "myResult" (EtlR $ ...) 
                   :-> (EtlR $ ... )
                   :-> (EtlR $
                           ROpStart
                           :.  (Minus 
                                   (TabL $ 
                                       juliusToRTable $ takeNamedResult "myResult" etlXpression    --  THIS IS THE POINT WHERE WE USE THE NAMED RESULT!
                                   ) 
                                   (Previous))
                       )

In the above Julius expression (etlXpresion) the "myResult" named result equals to the prefix of the etlXpresion, up to the operation (included) with the named result "myResult".

     
     takeNamedResult "myResult" etlXpression ==  EtlMapStart
                                                 :-> (EtlC $ ...)                
                                                 :=> NamedResult "myResult" (EtlR $ ...)

Note that the julius expression is scanned from right to left and thus it will return the longest prefix expression that matches the input name

Various ETL Operations

addSurrogateKeyJ Source #

Arguments

:: Integral a
=> ColumnName	The name of the surrogate key column
-> a	The initial value of the Surrogate Key will be the value of this parameter
-> RTable	Input RTable
-> RTable	Output RTable

Returns an UnaryRTableOperation (RTable -> RTable) that adds a surrogate key (SK) column to an RTable and fills each row with a SK value. It primarily is intended to be used within a Julius expression. For example:

 GenUnaryOp (On Tab rtab1) $ ByUnaryOp (addSurrogateKeyJ TxSK 0)

appendRTableJ Source #

Arguments

:: RTable	Target RTable
-> RTable	Input RTable
-> RTable	Output RTable

Returns a BinaryRTableOperation (RTable -> RTable -> RTable) that Appends an RTable to a target RTable. It is primarily intended to be used within a Julius expression. For example:

  GenBinaryOp (TabL rtab1) (Tab $ rtab2) $ ByBinaryOp appendRTableJ

addSurrogateKey Source #

Arguments

:: Integral a
=> ColumnName	The name of the surrogate key column -> Integer -- ^ The initial value of the Surrogate Key will be the value of this parameter
-> a	The initial value of the Surrogate Key will be the value of this parameter
-> ETLOperation	Output ETL operation which encapsulates the add surrogate key column mapping

Returns an ETLOperation that adds a surrogate key (SK) column to an RTable and fills each row with a SK value. This function is only exposed for backward compatibility reasons. The recommended function to use instead is addSurrogateKeyJ, which can be embedded directly into a Julius expression as a UnaryRTableOperation.

appendRTable Source #

Arguments

:: ETLOperation

Output ETL Operation

Returns an ETLOperation that Appends an RTable to a target RTable This function is only exposed for backward compatibility reasons. The recommended function to use instead is appendRTableJ, which can be embedded directly into a Julius expression as a BinaryRTableOperation.