DBFunctor-0.1.1.0: DBFunctor - Functional Data Management => ETL/ELT Data Processing in Haskell

Copyright(c) Nikos Karagiannidis 2018
LicenseBSD3
Maintainernkarag@gmail.com
Stabilitystable
PortabilityPOSIX
Safe HaskellNone
LanguageHaskell2010

Etl.Internal.Core

Contents

Description

This is an internal module (i.e., not to be imported directly) that implements the core ETL functionality that is exposed via the Julius EDSL for ETL/ELT found in the Etl.Julius module)

Synopsis

Basic Data Types

data RColMapping Source #

This is the basic data type to define the column-to-column mapping from a source RTable to a target RTable. Essentially, an RColMapping represents the column-level transformations of an RTuple that will yield a target RTuple.

A mapping is simply a triple of the form ( Source-Column(s), Target-Column(s), Transformation, RTuple-Filter), where we define the source columns over which a transformation (i.e. a function) will be applied in order to yield the target columns. Also, an RPredicate (i.e. a filter) might be applied on the source RTuple. Remember that an RTuple is essentially a mapping between a key (the Column Name) and a value (the RDataType value). So the various RColMapping data constructors below simply describe the possible modifications of an RTuple orginating from its own columns.

So, we can have the following mapping types: a) single-source column to single-target column mapping (1 to 1), the source column will be removed or not based on the removeSrcCol flag (dublicate column names are not allowed in an RTuple) b) multiple-source columns to single-target column mapping (N to 1), The N columns will be merged to the single target column based on the transformation. The N columns will be removed from the RTuple or not based on the removeSrcCol flag (dublicate column names are not allowed in an RTuple) c) single-source column to multiple-target columns mapping (1 to M) the source column will be "expanded" to M target columns based ont he transformation. the source column will be removed or not based on the removeSrcCol flag (dublicate column names are not allowed in an RTuple) d) multiple-source column to multiple target columns mapping (N to M) The N source columns will be mapped to M target columns based on the transformation. The N columns will be removed from the RTuple or not based on the removeSrcCol flag (dublicate column names are not allow in an RTuple)

Some examples of mapping are the following:

     (Start_Date, No, StartDate, t -> True)  --  copy the source value to target and dont remove the source column, so the target RTuple will have both columns Start_Date and StartDate
                                          --  with the exactly the same value)

     ([Amount, Discount], Yes, FinalAmount, ([a, d] -> a * d) ) -- FinalAmount is a derived column based on a function applied to the two source columns. 
                                                                       --  In the final RTuple we remove the two source columns.
  

An RColMapping can be applied with the runCM (runColMapping) operator

Constructors

ColMapEmpty 
RMap1x1

single-source column to single-target column mapping (1 to 1).

RMapNx1

multiple-source columns to single-target column mapping (N to 1)

RMap1xN

single-source column to multiple-target columns mapping (1 to N)

RMapNxM

multiple-source column to multiple target columns mapping (N to M)

type ColXForm = [RDataType] -> [RDataType] Source #

A Column Transformation function data type. It is used in order to define an arbitrary column-level transformation (i.e., from a list of N input Column-Values we produce a list of M derived (output) Column-Values). A Column value is represented with the RDataType.

createColMapping Source #

Arguments

:: [ColumnName]

List of source column names

-> [ColumnName]

List of target column names

-> ColXForm

Column Transformation function

-> YesNo

Remove source column option

-> RPredicate

Filtering predicate

-> RColMapping

Output Column Mapping

Constructs an RColMapping. This is the suggested method for creating a column mapping and not by calling the data constructors directly.

data ETLOperation Source #

An ETL operation applied to an RTable can be either an ROperation (a relational agebra operation like join, filter etc.) defined in RTable.Core module, or an RColMapping applied to an RTable

Constructors

ETLrOp 

Fields

ETLcOp 

Fields

data ETLMapping Source #

ETLmapping : it is the equivalent of a mapping in an ETL tool and consists of a series of ETLOperations that are applied, one-by-one, to some initial input RTable, but if binary ETLOperations are included in the ETLMapping, then there will be more than one input RTables that the ETLOperations of the ETLMapping will be applied to. When we apply (i.e., run) an ETLOperation of the ETLMapping we get a new RTable, which is then inputed to the next ETLOperation, until we finally run all ETLOperations. The purpose of the execution of an ETLMapping is to produce a single new RTable as the result of the execution of all the ETLOperations of the ETLMapping. In terms of database operations an ETLMapping is the equivalent of an CREATE AS SELECT (CTAS) operation in an RDBMS. This means that anything that can be done in the SELECT part (i.e., column projection, row filtering, grouping and join operations, etc.) in order to produce a new table, can be included in an ETLMapping.

An ETLMapping is executed with the etl (runETLmapping) operator

Implementation: An ETLMapping is implemented as a binary tree where the node represents the ETLOperation to be executed and the left branch is another ETLMapping, while the right branch is an RTable (that might be empty in the case of a Unary ETLOperation). Execution proceeds from bottom-left to top-right. This is similar in concept to a left-deep join tree. In a Left-Deep ETLOperation tree the "pipe" of ETLOperations comes from the left branches always. The leaf node is always an ETLMapping with an ETLMapEmpty in the left branch and an RTable in the right branch (the initial RTable inputed to the ETLMapping). In this way, the result of the execution of each ETLOperation (which is an RTable) is passed on to the next ETLOperation. Here is an example:

    A Left-Deep ETLOperation Tree

                             final RTable result
                                   / 
                                etlOp3 
                             /        
                          etlOp2     rtab2
                         /       
A leaf-node -->    etlOp1    emptyRTab
                   /       
             ETLMapEmpty   rtab1

You see that always on the left branch we have an ETLMapping data type (i.e., a left-deep ETLOperation tree). So how do we implement the following case?

                    final RTable result
                            / 
 A leaf-node -->         etlOp1 
                         /       
                        rtab1   rtab2

The answer is that we "model" the left RTable (rtab1 in our example) as an ETLMapping of the form:

 ETLMapLD { etlOp = ETLcOp{cmap = ColMapEmpty}, tabL = ETLMapEmpty, tabR = rtab1 }

So we embed the rtab1 in a ETLMapping, which is a leaf (i.e., it has an empty prevMap), the rtab1 is in the right branch (tabR) and the ETLOperation is the EmptyColMapping, which returns its input RTable when executed. We can use function rtabToETLMapping for this job. So it becomes A leaf-node --> etlOp1 / rtabToETLMapping rtab1 rtab2

In this manner, a leaf-node can also be implemented like this:

                             final RTable result
                                   / 
                                etlOp3 
                             /        
                          etlOp2     rtab2
                         /       
A leaf-node -->    etlOp1    emptyRTab
                   /     
  rtabToETLMapping rtab1  emptyRTable

Constructors

ETLMapEmpty

an empty node

ETLMapLD

a Left-Deep node

Fields

  • etlOp :: ETLOperation

    the ETLOperation to be executed

  • tabL :: ETLMapping

    the left-branch corresponding to the previous ETLOperation, which is input to this one.

  • tabR :: RTable

    the right branch corresponds to another RTable (for binary ETL operations). If this is a Unary ETLOperation then this field must be an empty RTable.

ETLMapRD

a Right-Deep node

Fields

  • etlOp :: ETLOperation

    the ETLOperation to be executed

  • tabLrd :: RTable

    the left-branch corresponds to another RTable (for binary ETL operations). If this is a Unary ETLOperation then this field must be an empty RTable.

  • tabRrd :: ETLMapping

    the right branch corresponding to the previous ETLOperation, which is input to this one.

ETLMapBal

a Balanced node

Fields

  • etlOp :: ETLOperation

    the ETLOperation to be executed

  • tabLbal :: ETLMapping

    the left-branch corresponding to the previous ETLOperation, which is input to this one. If this is a Unary ETLOperation then this field might be an empty ETLMapping.

  • tabRbal :: ETLMapping

    the right branch corresponding corresponding to the previous ETLOperation, which is input to this one. -- If this is a Unary ETLOperation then this field might be an empty ETLMapping.

Instances
Eq ETLMapping Source # 
Instance details

Defined in Etl.Internal.Core

data YesNo Source #

Constructors

Yes 
No 
Instances
Eq YesNo Source # 
Instance details

Defined in Etl.Internal.Core

Methods

(==) :: YesNo -> YesNo -> Bool #

(/=) :: YesNo -> YesNo -> Bool #

Show YesNo Source # 
Instance details

Defined in Etl.Internal.Core

Methods

showsPrec :: Int -> YesNo -> ShowS #

show :: YesNo -> String #

showList :: [YesNo] -> ShowS #

Execution of an ETL Mapping

runCM :: RColMapping -> RTable -> RTable Source #

runCM operator executes an RColMapping If a target-column has the same name with a source-column and a DontRemoveSrc (i.e., removeSrcCol == No) has been specified, then the (target-column, target-value) key-value pair, overwrites the corresponding (source-column, source-value) key-value pair

etlOpU :: ETLOperation -> RTable -> RTable Source #

executes a Unary ETL Operation

etlOpB :: ETLOperation -> RTable -> RTable -> RTable Source #

executes a Binary ETL Operation

etl :: ETLMapping -> RTable Source #

This operator executes an ETLMapping

etlRes Source #

Arguments

:: ETLMapping

input ETLMapping

-> RTabResult

output RTabResult

This operator executes an ETLMapping and returns the RTabResult Writer Monad that embedds apart from the resulting RTable, also the number of RTuples returned

Functions for "Building" an ETL Mapping

rtabToETLMapping :: RTable -> ETLMapping Source #

Model an RTable as an ETLMapping which when executed will return the input RTable

createLeafETLMapLD Source #

Arguments

:: ETLOperation

ETL operation of this ETL mapping

-> RTable

input RTable

-> ETLMapping

output ETLMapping

Creates a left-deep leaf ETL Mapping, of the following form:

    A Left-Deep ETLOperation Tree

                             final RTable result
                                   / 
                                etlOp3 
                             /        
                          etlOp2     rtab2
                         /       
A leaf-node -->    etlOp1    emptyRTab
                   /       
             ETLMapEmpty   rtab1

createLeafBinETLMapLD Source #

Arguments

:: ETLOperation

ETL operation of this ETL mapping

-> RTable

input RTable1

-> RTable

input RTable2

-> ETLMapping

output ETLMapping

creates a Binary operation leaf node of the form:

A leaf-node -->    etlOp1    
                   /     
  rtabToETLMapping rtab1  rtab2

connectETLMapLD Source #

Arguments

:: ETLOperation

ETL operation of this ETL Mapping

-> RTable

Right RTable (right branch) (if this is a Unary ETL mapping this should be an emptyRTable)

-> ETLMapping

Previous ETL mapping (left branch)

-> ETLMapping

New ETL Mapping, which has added at the end the new node

Connects an ETL Mapping to a left-deep ETL Mapping tree, of the form

    A Left-Deep ETLOperation Tree

                             final RTable result
                                   / 
                                etlOp3 
                             /        
                          etlOp2     rtab2
                         /       
A leaf-node -->    etlOp1    emptyRTab
                   /       
             ETLMapEmpty   rtab1

Example:

  -- connect a Unary ETL mapping (etlOp2)

                          etlOp2    
                         /       
                      etlOp1    emptyRTab
       
  => connectETLMapLD etlOp2 emptyRTable prevMap

  -- connect a Binary ETL Mapping (etlOp3)

                                etlOp3 
                             /        
                          etlOp2     rtab2

  => connectETLMapLD etlOp3 rtab2 prevMap

Note that the right branch (RTable) appears first in the list of input arguments of this function and the left branch (ETLMapping) appears second. This is strange, and one could thought that it is a mistake (i.e., the left branch should appear first and the right branch second) since we are reading from left to right. However this was a deliberate choice, so that we leave the left branch (which is the connection point with the previous ETLMapping) as the last argument, and thus we can partially apply the argumenets and get a new function with input parameter only the previous mapping. This is very helpfull in function composition

Various ETL Operations