DBFunctor-0.1.1.1: DBFunctor - Functional Data Management => ETL/ELT Data Processing in Haskell

Etl.Internal.Core

Description

This is an internal module (i.e., not to be imported directly) that implements the core ETL functionality that is exposed via the Julius EDSL for ETL/ELT found in the Etl.Julius module)

Synopsis

# Basic Data Types

This is the basic data type to define the column-to-column mapping from a source RTable to a target RTable. Essentially, an RColMapping represents the column-level transformations of an RTuple that will yield a target RTuple.

A mapping is simply a triple of the form ( Source-Column(s), Target-Column(s), Transformation, RTuple-Filter), where we define the source columns over which a transformation (i.e. a function) will be applied in order to yield the target columns. Also, an RPredicate (i.e. a filter) might be applied on the source RTuple. Remember that an RTuple is essentially a mapping between a key (the Column Name) and a value (the RDataType value). So the various RColMapping data constructors below simply describe the possible modifications of an RTuple orginating from its own columns.

So, we can have the following mapping types: a) single-source column to single-target column mapping (1 to 1), the source column will be removed or not based on the removeSrcCol flag (dublicate column names are not allowed in an RTuple) b) multiple-source columns to single-target column mapping (N to 1), The N columns will be merged to the single target column based on the transformation. The N columns will be removed from the RTuple or not based on the removeSrcCol flag (dublicate column names are not allowed in an RTuple) c) single-source column to multiple-target columns mapping (1 to M) the source column will be "expanded" to M target columns based ont he transformation. the source column will be removed or not based on the removeSrcCol flag (dublicate column names are not allowed in an RTuple) d) multiple-source column to multiple target columns mapping (N to M) The N source columns will be mapped to M target columns based on the transformation. The N columns will be removed from the RTuple or not based on the removeSrcCol flag (dublicate column names are not allow in an RTuple)

Some examples of mapping are the following:

     (Start_Date, No, StartDate, t -> True)  --  copy the source value to target and dont remove the source column, so the target RTuple will have both columns Start_Date and StartDate
--  with the exactly the same value)

([Amount, Discount], Yes, FinalAmount, ([a, d] -> a * d) ) -- FinalAmount is a derived column based on a function applied to the two source columns.
--  In the final RTuple we remove the two source columns.


An RColMapping can be applied with the runCM (runColMapping) operator

Constructors

 ColMapEmpty RMap1x1 single-source column to single-target column mapping (1 to 1). Fields RMapNx1 multiple-source columns to single-target column mapping (N to 1) FieldssrcColGrp :: [ColumnName] removeSrcCol :: YesNo trgCol :: ColumnName transformNx1 :: [RDataType] -> RDataType srcRTupleFilter :: RPredicate RMap1xN single-source column to multiple-target columns mapping (1 to N) FieldssrcCol :: ColumnName removeSrcCol :: YesNo trgColGrp :: [ColumnName] transform1xN :: RDataType -> [RDataType] srcRTupleFilter :: RPredicate RMapNxM multiple-source column to multiple target columns mapping (N to M) FieldssrcColGrp :: [ColumnName] removeSrcCol :: YesNo trgColGrp :: [ColumnName] transformNxM :: [RDataType] -> [RDataType] srcRTupleFilter :: RPredicate

type ColXForm = [RDataType] -> [RDataType] Source #

A Column Transformation function data type. It is used in order to define an arbitrary column-level transformation (i.e., from a list of N input Column-Values we produce a list of M derived (output) Column-Values). A Column value is represented with the RDataType.

Arguments

 :: [ColumnName] List of source column names -> [ColumnName] List of target column names -> ColXForm Column Transformation function -> YesNo Remove source column option -> RPredicate Filtering predicate -> RColMapping Output Column Mapping

Constructs an RColMapping. This is the suggested method for creating a column mapping and not by calling the data constructors directly.

An ETL operation applied to an RTable can be either an ROperation (a relational agebra operation like join, filter etc.) defined in RTable.Core module, or an RColMapping applied to an RTable

Constructors

 ETLrOp Fieldsrop :: ROperation ETLcOp Fieldscmap :: RColMapping

ETLmapping : it is the equivalent of a mapping in an ETL tool and consists of a series of ETLOperations that are applied, one-by-one, to some initial input RTable, but if binary ETLOperations are included in the ETLMapping, then there will be more than one input RTables that the ETLOperations of the ETLMapping will be applied to. When we apply (i.e., run) an ETLOperation of the ETLMapping we get a new RTable, which is then inputed to the next ETLOperation, until we finally run all ETLOperations. The purpose of the execution of an ETLMapping is to produce a single new RTable as the result of the execution of all the ETLOperations of the ETLMapping. In terms of database operations an ETLMapping is the equivalent of an CREATE AS SELECT (CTAS) operation in an RDBMS. This means that anything that can be done in the SELECT part (i.e., column projection, row filtering, grouping and join operations, etc.) in order to produce a new table, can be included in an ETLMapping.

An ETLMapping is executed with the etl (runETLmapping) operator

Implementation: An ETLMapping is implemented as a binary tree where the node represents the ETLOperation to be executed and the left branch is another ETLMapping, while the right branch is an RTable (that might be empty in the case of a Unary ETLOperation). Execution proceeds from bottom-left to top-right. This is similar in concept to a left-deep join tree. In a Left-Deep ETLOperation tree the "pipe" of ETLOperations comes from the left branches always. The leaf node is always an ETLMapping with an ETLMapEmpty in the left branch and an RTable in the right branch (the initial RTable inputed to the ETLMapping). In this way, the result of the execution of each ETLOperation (which is an RTable) is passed on to the next ETLOperation. Here is an example:

    A Left-Deep ETLOperation Tree

final RTable result
/
etlOp3
/
etlOp2     rtab2
/
A leaf-node -->    etlOp1    emptyRTab
/
ETLMapEmpty   rtab1



You see that always on the left branch we have an ETLMapping data type (i.e., a left-deep ETLOperation tree). So how do we implement the following case?

                    final RTable result
/
A leaf-node -->         etlOp1
/
rtab1   rtab2



The answer is that we "model" the left RTable (rtab1 in our example) as an ETLMapping of the form:

 ETLMapLD { etlOp = ETLcOp{cmap = ColMapEmpty}, tabL = ETLMapEmpty, tabR = rtab1 }


So we embed the rtab1 in a ETLMapping, which is a leaf (i.e., it has an empty prevMap), the rtab1 is in the right branch (tabR) and the ETLOperation is the EmptyColMapping, which returns its input RTable when executed. We can use function rtabToETLMapping for this job. So it becomes  A leaf-node --> etlOp1 / rtabToETLMapping rtab1 rtab2 

In this manner, a leaf-node can also be implemented like this:

                             final RTable result
/
etlOp3
/
etlOp2     rtab2
/
A leaf-node -->    etlOp1    emptyRTab
/
rtabToETLMapping rtab1  emptyRTable


Constructors

 ETLMapEmpty an empty node ETLMapLD a Left-Deep node FieldsetlOp :: ETLOperationthe ETLOperation to be executed tabL :: ETLMappingthe left-branch corresponding to the previous ETLOperation, which is input to this one.tabR :: RTablethe right branch corresponds to another RTable (for binary ETL operations). If this is a Unary ETLOperation then this field must be an empty RTable. ETLMapRD a Right-Deep node FieldsetlOp :: ETLOperationthe ETLOperation to be executed tabLrd :: RTablethe left-branch corresponds to another RTable (for binary ETL operations). If this is a Unary ETLOperation then this field must be an empty RTable. tabRrd :: ETLMappingthe right branch corresponding to the previous ETLOperation, which is input to this one. ETLMapBal a Balanced node FieldsetlOp :: ETLOperationthe ETLOperation to be executed tabLbal :: ETLMappingthe left-branch corresponding to the previous ETLOperation, which is input to this one. If this is a Unary ETLOperation then this field might be an empty ETLMapping. tabRbal :: ETLMappingthe right branch corresponding corresponding to the previous ETLOperation, which is input to this one. -- If this is a Unary ETLOperation then this field might be an empty ETLMapping.
Instances
 Source # Instance detailsDefined in Etl.Internal.Core Methods

data YesNo Source #

Constructors

 Yes No
Instances
 Source # Instance detailsDefined in Etl.Internal.Core Methods(==) :: YesNo -> YesNo -> Bool #(/=) :: YesNo -> YesNo -> Bool # Source # Instance detailsDefined in Etl.Internal.Core MethodsshowsPrec :: Int -> YesNo -> ShowS #show :: YesNo -> String #showList :: [YesNo] -> ShowS #

# Execution of an ETL Mapping

runCM operator executes an RColMapping If a target-column has the same name with a source-column and a DontRemoveSrc (i.e., removeSrcCol == No) has been specified, then the (target-column, target-value) key-value pair, overwrites the corresponding (source-column, source-value) key-value pair

executes a Unary ETL Operation

executes a Binary ETL Operation

This operator executes an ETLMapping

Arguments

 :: ETLMapping input ETLMapping -> RTabResult output RTabResult

This operator executes an ETLMapping and returns the RTabResult Writer Monad that embedds apart from the resulting RTable, also the number of RTuples returned

# Functions for "Building" an ETL Mapping

Model an RTable as an ETLMapping which when executed will return the input RTable

Arguments

 :: ETLOperation ETL operation of this ETL mapping -> RTable input RTable -> ETLMapping output ETLMapping

Creates a left-deep leaf ETL Mapping, of the following form:

    A Left-Deep ETLOperation Tree

final RTable result
/
etlOp3
/
etlOp2     rtab2
/
A leaf-node -->    etlOp1    emptyRTab
/
ETLMapEmpty   rtab1



Arguments

 :: ETLOperation ETL operation of this ETL mapping -> RTable input RTable1 -> RTable input RTable2 -> ETLMapping output ETLMapping

creates a Binary operation leaf node of the form:

A leaf-node -->    etlOp1
/
rtabToETLMapping rtab1  rtab2


Arguments

 :: ETLOperation ETL operation of this ETL Mapping -> RTable Right RTable (right branch) (if this is a Unary ETL mapping this should be an emptyRTable) -> ETLMapping Previous ETL mapping (left branch) -> ETLMapping New ETL Mapping, which has added at the end the new node

Connects an ETL Mapping to a left-deep ETL Mapping tree, of the form

    A Left-Deep ETLOperation Tree

final RTable result
/
etlOp3
/
etlOp2     rtab2
/
A leaf-node -->    etlOp1    emptyRTab
/
ETLMapEmpty   rtab1



Example:

  -- connect a Unary ETL mapping (etlOp2)

etlOp2
/
etlOp1    emptyRTab

=> connectETLMapLD etlOp2 emptyRTable prevMap

-- connect a Binary ETL Mapping (etlOp3)

etlOp3
/
etlOp2     rtab2

=> connectETLMapLD etlOp3 rtab2 prevMap


Note that the right branch (RTable) appears first in the list of input arguments of this function and the left branch (ETLMapping) appears second. This is strange, and one could thought that it is a mistake (i.e., the left branch should appear first and the right branch second) since we are reading from left to right. However this was a deliberate choice, so that we leave the left branch (which is the connection point with the previous ETLMapping) as the last argument, and thus we can partially apply the argumenets and get a new function with input parameter only the previous mapping. This is very helpfull in function composition