![dbfunctor logo](./dbfunctor.png) # DBFunctor: Functional Data Management ## ETL/ELT* Data Processing in Haskell **[DBFunctor](https://hackage.haskell.org/package/DBFunctor)** is a [Haskell](https://haskell-lang.org/) library for *ETL/ELT[^1]* data processing of tabular data. What does this mean? It simply means that whenever you have a ***data analysis*, *data preparation*, or *data transformation* task** and you want to do it with Haskell type-safe code, that you enjoy, love and trust so much, now you can! ### Main Features 1. **[Julius](https://hackage.haskell.org/package/DBFunctor-0.1.0.0/docs/Etl-Julius.html) DSL: A Type-Level Embedded Domain Specific Language (EDSL) for ETL** Provides an intuitive type-level Embedded Domain Specific (EDSL) Language called *Julius* for expressing complex data flows (i.e., ETL flows) but also for performing SQL-like data analysis. For more info check this [Julius tutorial](https://github.com/nkarag/haskell-DBFunctor/blob/master/doc/JULIUS-TUTORIAL.md). 2. **Supports all known relational algrebra operations** Julius supports all known relational algebra operations (selection, projection, inner/outer join, grouping, ordering, aggregation, set operations etc.) 3. **Provides the ETL Mapping and other typical ETL constructs and operations** Julius implements typical ETL constructs such the *Column Mapping* and the *ETL Mapping*. 4. **Applicable to all kinds of tabular data** It is applicable to all kinds of "tabular data" (see explanation below) 5. **In-memory, database-less data processing** Data transformations or queries can run *in-memory*, within your Haskell code, without the need for a database to process your data. 6. **Offloading to a database for heavy queries/data transformations** In addition, a query or data transformation can be *offloaded to a Database*, when data don't fit in memory, or heavy data processing over large volumes of data is required. The result can be fetched into the client's memory (i.e., where your haskell code runs) in the `RTable` data structure (see below), or stored in a database staging table. 7. **Workflow Operations** Julius provides common workflow operations. Workflows provide the ability to combine the evaluation of several different Julius Expressions (i.e., data pipelines) in an arbitrary logic. Examples of such operations include: - Ability to handle a failure of some operation in a Julius expression: - retry the failed operation (after corrective actions have taken place) and continue the evaluation of the Julius expression from this point onward. - skip the failed operation and move on with the rest operations in the pipeline. - restart the Julius expression from the beginning - terminate the Julius expression and skip all pending operations - Ability to start a Julius expression based on the success or failure result of another one - Ability to fork several different Julius expressions that will run concurrently - Conditional execution of Julius expressions and iteration functionality - Workflow hierarchy (i.e., flows, subflows etc.) 8. **"Declarative ETL"** Enables *declarative ETL* implementation in the same sense that SQL is declarative for querying data (see more below). ### Typical examples of DBFunctor use-cases - **Build database-less Haskell apps.** Build your data processing haskell apps without the need to import your data in a database for querying functionality or any for executing any data transformations. Analyze your CSV files in-place with plain haskell code (*for Haskellers!*). - **Data Preparation.** I.e., clean-up data, calculate derived fields and variables, group by and aggregate etc., in order to feed some machine learning algorithm (*for Data Scientists*). - **Data Transformation.** in order to transform data from Data Model A to Data Model B (typical use-case *for Data Engineers* who perform ETL/ELT[^1] tasks for feeding Data Warehouses or Data Marts) - **Data Exploration.** Ad hoc data analysis tasks, in order to explore a data set for several purposes such as to find business insights and solve a specific business problem, or maybe to do data profiling in order to evaluate the quality of the data coming from a data source, etc (*for Data Analysts*). - **Business Intelligence.** Build reports, or dashboards in order to share business insights with others and drive decision making process (*for BI power-users*) [^1]: **ETL** stands for **Extract Transform and Load** and is the standard technology for accomplishing data management tasks in Data Warehouses / Data Marts and in general for preparing data for any analytic purposes (Ad hoc queries, data exploration/data analysis, Reporting and Business Intelligence, feeding Machine Learning algorithms, etc.). **ELT** is a newer variation of ETL and means that the data are first Loaded into their final destination and then the data transformation runs in-place (as opposed to running at a separate staging area on possibly a different server)). ### When to Use it? DBFunctor should be used whenever a data analysis, or data manipulation, or data transformation task, over *tabular data*, must be performed and we wish to perform it with Haskell code -yielding all the well-known (to Haskellers) benefits from doing that- without the need to use a database query engine for this task. DBFunctor provides an in-memory data structure called `RTable`, which implements the concept of a *Relational Table* (which -simply put- is a set of tuples) and all relevant relational algebra operations (Selection, Projection, Inner Join, Outer Joins, aggregations, Group By, Set Operations etc.). Moreover, it implements the concept of *Column Mapping* (for deriving new columns based on existing ones - by splitting , merging , or with any other possible combination using a lambda expression or a function to define the new value) and that of the *ETL Mapping*, which is the equivalent of a "mapping" in an ETL tool (like Informatica, Talend, Oracle Data Integrator, SSIS, Pentaho, etc.). With this powerful construct, one can **build arbitrary complex data pipelines**, which can enable any type of data transformations and all these **by writing Haskell code.** ### What Kinds of Data? With the term "tabular data" we mean any type of data that can be mapped to an RTable (e.g., CSV (or any other delimiter), DB Table/Query, JSON etc). Essentially, for a Haskell data type `a`to be "tabular", one must implement the following functions: ```haskell toRTable :: RTableMData -> a -> RTable fromRTable :: RTableMData -> RTable -> a ``` These two functions implement the "logic" of transforming data type `a` to/from an RTable based on specific RTable Metadata, which specify the column names and data types of the RTable, as well as (optionally) the primary key constraint, and/or alternative unique constraints (i.e., similar information provided with a CREATE TABLE statement in SQL) . By implementing these two functions, data type `a` essentially becomes an instance of the type `class RTabular` and thus can be transformed with the DBFunctor package. Currently, we have implemented a CSV data type (any delimeter allowed), based one the [Cassava](https://github.com/haskell-hvr/cassava) library, in order to enable data transformations over CSV files. ### Current Status and Roadmap Currently (version DBFunctor-0.1.0.0), the DBFunctor package **is stable for in-memory data transformation and queries of CSV files** (any delimiter allowed), with the **Julius EDSL** (module Etl.Julius) , or directly via RTable functions (module RTable.Core). The use of the Julius language is strongly recommended because it simplifies greatly and standardizes the creation of complex ETL flows. All in all, currently main features from #1 to #5 (from the list above) have been implemented and main features > #5 are future work that will be released in later versions. ### Future Vision -> Declarative ETL Our ultimate goal is, eventually to make DBFunctor the **first *Declarative library for ETL/ELT, or data processing in general***, by exploiting the virtues of functional programming and Haskell strong type system in particular. Here we use "declarative" in the same sense that SQL is a declarative language for querying data. (You only have to state what data you want to be returned and you don't care about how this will be accomplished - the DBMS engine does this for you behind the scenes). In the same manner, ideally, one should only need to code the desired data transformation from a *source schema* to a *target schema*, as well as all the *data integrity constraints* and *business rules* that should hold after the transformation and not having to define all the individual steps for implementing the transformation, as it is the common practice today. This will yield tremendous benefits compared to common ETL challenges faced today and change the way we build data transformation flows. Just to name a few: - Automated ETL coding driven by Source-to-Target mapping and business rules - ETL code correctness out-of-the-box - Data Integrity / Business Rules controls automatically embedded in your ETL code - Self-documented ETL code (Your documentation i.e., the Source-to-Target mapping and the business rules, is also the only code you need to write!) - Drastically minimize time-to-market for delivering Data Marts and Data Warehouses, or simply implementing Data Analysis tasks. The above is inspired by the theoretical work on Categorical Databases by David Spivak, ### Available Modules *DBFunctor* consists of the following set of Haskell modules: * **RTable.Core**: Implements the relational Table concept. Defines all necessary data types like `RTable` and `RTuple` as well as basic relational algebra operations on RTables. * **Etl.Julius**: A simple Embedded DSL for ETL/ELT data processing in Haskell * **RTable.Data.CSV**: Implements `RTable` over CSV (TSV, or any other delimiter) files logic. It is based on the [Cassava](https://github.com/haskell-hvr/cassava) library. ### A Very Simple Example In this example, we will load a CSV file, turn it into an RTable and then issue a very simple query on it and print the result, just to show the whole concept. So lets say we have a CSV file called test-data.csv. The file stores table metadata from an Oracle database. Each row represents a table stored in the database. Here is a small extract from the csv file: $ head test-data.csv OWNER,TABLE_NAME,TABLESPACE_NAME,STATUS,NUM_ROWS,BLOCKS,LAST_ANALYZED APEX_030200,SYS_IOT_OVER_71833,SYSAUX,VALID,0,0,06/08/2012 16:22:36 APEX_030200,WWV_COLUMN_EXCEPTIONS,SYSAUX,VALID,3,3,06/08/2012 16:22:33 APEX_030200,WWV_FLOWS,SYSAUX,VALID,10,3,06/08/2012 22:01:21 APEX_030200,WWV_FLOWS_RESERVED,SYSAUX,VALID,0,0,06/08/2012 16:22:33 APEX_030200,WWV_FLOW_ACTIVITY_LOG1$,SYSAUX,VALID,1,29,07/20/2012 19:07:57 APEX_030200,WWV_FLOW_ACTIVITY_LOG2$,SYSAUX,VALID,14,29,07/20/2012 19:07:57 APEX_030200,WWV_FLOW_ACTIVITY_LOG_NUMBER$,SYSAUX,VALID,1,3,07/20/2012 19:08:00 APEX_030200,WWV_FLOW_ALTERNATE_CONFIG,SYSAUX,VALID,0,0,06/08/2012 16:22:33 APEX_030200,WWV_FLOW_ALT_CONFIG_DETAIL,SYSAUX,VALID,0,0,06/08/2012 16:22:33 **1. Turn the CSV file into an RTable** The first thing we want to do is to read the file and turn it into an RTable. In order to do this we need to define the RTable Metadata, which is the same information one can provide in an SQL CREATE TABLE statement, i,e, column names, column data types and integrity constraints (Primary Key, Unique Key only - no Foreign Keys). So lets see how this is done: ```haskell -- Define table metadata src_DBTab_MData :: RTableMData src_DBTab_MData = createRTableMData ( "sourceTab" -- table name ,[ ("OWNER", Varchar) -- Owner of the table ,("TABLE_NAME", Varchar) -- Name of the table ,("TABLESPACE_NAME", Varchar) -- Tablespace name ,("STATUS",Varchar) -- Status of the table object (VALID/IVALID) ,("NUM_ROWS", Integer) -- Number of rows in the table ,("BLOCKS", Integer) -- Number of Blocks allocated for this table ,("LAST_ANALYZED", Timestamp "MM/DD/YYYY HH24:MI:SS") -- Timestamp of the last time the table was analyzed (i.e., gathered statistics) ] ) ["OWNER", "TABLE_NAME"] -- primary key [] -- (alternative) unique keys main :: IO () main = do -- read source csv file srcCSV <- readCSV "./app/test-data.csv" let -- turn source csv to an RTable src_DBTab = toRTable src_DBTab_MData srcCSV ... ``` We have used the following functions: ```haskell -- | createRTableMData : creates RTableMData from input given in the form of a list -- We assume that the column order of the input list defines the fixed column order of the RTuple. createRTableMData :: (RTableName, [(ColumnName, ColumnDType)]) -> [ColumnName] -- ^ Primary Key. [] if no PK exists -> [[ColumnName]] -- ^ list of unique keys. [] if no unique keys exists -> RTableMData ``` in order to define the RTable metadata. For reading the CSV file we have used: ```haskell -- | readCSV: reads a CSV file and returns a CSV data type (Treating CSV data as opaque byte strings) readCSV :: FilePath -- ^ the CSV file -> IO CSV -- ^ the output CSV type ``` Finally, in order to turn the CSV data type into an RTable, we have used function: ```haskell toRTable :: RTableMData -> CSV -> RTable ``` which comes from the `RTabular` type class instance of the `CSV` data type. **2. Query the RTable** Once we have created an RTable, we can issue queries on it, or apply any type of data transformations. Note that due to immutability, each query or data transformation creates a new RTable. We will now issue the following query: We return all the rows, which correspond to some filter predicate - in particular all rows where the `TABLE_NAME` includes some search string and the `LAST_ANALYZED` field is greater than an input date. For this we use the Julius EDSL, in order to express the query and then with the function `runJulius`, we evaluate the expression into an RTable. ```haskell runJulius :: ETLMappingExpr -> IO RTable ``` Here is the Julius expression that yield the desired results. ```haskell julExpr srch dtstr rtab = EtlMapStart :-> (EtlR $ ROpStart :. (Filter (From $ Tab rtab) $ FilterBy (\t -> case instrRText (RText srch) (t "TABLE_NAME") of Just p -> True Nothing -> False && (t "LAST_ANALYZED") >= (RTime $ toRTimestamp "DD/MM/YYYY" dtstr) ) ) ) ``` A Julius expression is a *data processing chain* consisting of various Relational Algebra operations `(EtlR $ ...)` and/or column mappings `(EtlC $ ...)` connected together via the `:->` data constructor, of the form (Julius expressions are read *from top-to-bottom or from left-to-right*): ```haskell myJulExpression = EtlMapStart :-> (EtlC $ ...) -- this is a Column Mapping :-> (EtlR $ -- this is a series of Relational Algebra Operations ROpStart :. (Rel Operation 1) -- a relational algebra operation :. (Rel Operation 2)) :-> (EtlC $ ...) -- This is another Column Mapping :-> (EtlR $ -- more relational algebra operations ROpStart :. (Rel Operation 3) :. (Rel Operation 4) :. (Rel Operation 5)) :-> (EtlC $ ...) -- This is Column Mapping 3 :-> (EtlC $ ...) -- This is Column Mapping 4 ... ``` In our example, the Julius expression consists only of a single relational algebra operation, namely a `Filter` operation, which uses an RTuple predicate of the form `RTuple -> Bool` to filter out RTuples (i.e., rows) that dont satisfy this predicate. The predicate is expressed as the lambda expression: ```haskell FilterBy (\t -> case instrRText (RText srch) (t "TABLE_NAME") of Just p -> True Nothing -> False && (t "LAST_ANALYZED") >= (RTime $ toRTimestamp "DD/MM/YYYY" dtstr) ``` We use the [instrRText](https://hackage.haskell.org/package/DBFunctor-0.1.0.0/docs/RTable-Core.html#g:16) function to find these table_name values that include the `srch` string. Also, we use the [toRTimestamp](https://hackage.haskell.org/package/DBFunctor-0.1.0.0/docs/RTable-Core.html#g:15) function, in order to turn the date string `dtstr` into an `RTimestamp` data type and compare it against the `LAST_ANALYZED` column, Finally, in order to print the result of the query on the screen, we use the ```haskell printfRTable :: RTupleFormat -> RTable -> IO() ``` function, which brings printf-like functionality into the printing of RTables And here is the output: ``` $ stack exec -- exampleDBFunctor Print all tables that incude a "search string" in their name and have been analyzed after a specific date Give me the search string: FLOW Give me the date in "DD/MM/YYYY" format: 01/04/2018 --------------------------------------------------------------------------------------------------------------------------------- OWNER TABLE_NAME TABLESPACE_NAME STATUS NUM_ROWS BLOCKS LAST_ANALYZED ~~~~~ ~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~ ~~~~~~~~ ~~~~~~ ~~~~~~~~~~~~~ APEX_040100 WWV_FLOW_ACTIVITY_LOG1$ SYSAUX VALID 4052 155 04/04/2018 18:19:56 APEX_040100 WWV_FLOW_ACTIVITY_LOG2$ SYSAUX VALID 1771 92 16/04/2018 17:33:16 APEX_040100 WWV_FLOW_ACTIVITY_LOG_NUMBER$ SYSAUX VALID 1 3 10/04/2018 16:09:25 APEX_040100 WWV_FLOW_COMPANIES SYSAUX VALID 10 3 16/04/2018 17:33:13 APEX_040100 WWV_FLOW_DATA SYSAUX VALID 109 155 16/04/2018 16:06:38 APEX_040100 WWV_FLOW_DEBUG_MESSAGES2 SYSAUX VALID 0 0 10/04/2018 16:09:25 APEX_040100 WWV_FLOW_FND_USER SYSAUX VALID 50 3 05/04/2018 18:09:23 APEX_040100 WWV_FLOW_PAGE_CACHE SYSAUX VALID 22 3 05/04/2018 18:35:18 APEX_040100 WWV_FLOW_SESSIONS$ SYSAUX VALID 182 26 16/04/2018 16:07:13 APEX_040100 WWV_FLOW_USER_ACCESS_LOG1$ SYSAUX VALID 127 5 11/04/2018 18:27:17 APEX_040100 WWV_FLOW_USER_ACCESS_LOG2$ SYSAUX VALID 39 5 16/04/2018 17:30:59 APEX_040100 WWV_FLOW_USER_ACCESS_LOG_NUM$ SYSAUX VALID 1 3 12/04/2018 16:30:05 APEX_040100 WWV_FLOW_WORKSHEET_CONDITIONS SYSAUX VALID 501 18 16/04/2018 17:33:03 APEX_040100 WWV_FLOW_WORKSHEET_GROUP_BY SYSAUX VALID 22 3 03/04/2018 17:44:07 APEX_040100 WWV_FLOW_WORKSHEET_RPTS SYSAUX VALID 505 16 16/04/2018 17:33:01 FLOWS_FILES WWV_FLOW_FILE_OBJECTS$ SYSAUX VALID 302 16 13/04/2018 18:00:05 16 rows returned --------------------------------------------------------------------------------------------------------------------------------- ``` **Here is the complete example**. ```haskell {-# LANGUAGE OverloadedStrings #-} module Main where import Etl.Julius import RTable.Data.CSV (CSV, readCSV, toRTable, writeCSV) import Data.Text.IO as T (getLine) -- This is the input source table metadata -- It includes the tables stored in an imaginary database src_DBTab_MData :: RTableMData src_DBTab_MData = createRTableMData ( "sourceTab" -- table name ,[ ("OWNER", Varchar) -- Owner of the table ,("TABLE_NAME", Varchar) -- Name of the table ,("TABLESPACE_NAME", Varchar) -- Tablespace name ,("STATUS",Varchar) -- Status of the table object (VALID/IVALID) ,("NUM_ROWS", Integer) -- Number of rows in the table ,("BLOCKS", Integer) -- Number of Blocks allocated for this table ,("LAST_ANALYZED", Timestamp "MM/DD/YYYY HH24:MI:SS") -- Timestamp of the last time the table was analyzed (i.e., gathered statistics) ] ) ["OWNER", "TABLE_NAME"] -- primary key [] -- (alternative) unique keys -- Result RTable metadata result_tab_MData :: RTableMData result_tab_MData = createRTableMData ( "resultTab" -- table name ,[ ("OWNER", Varchar) -- Owner of the table ,("TABLE_NAME", Varchar) -- Name of the table ,("LAST_ANALYZED", Timestamp "MM/DD/YYYY HH24:MI:SS") -- Timestamp of the last time the table was analyzed (i.e., gathered statistics) ] ) ["OWNER", "TABLE_NAME"] -- primary key [] -- (alternative) unique keys main :: IO () main = do -- read source csv file srcCSV <- readCSV "./app/test-data.csv" putStrLn "\nPrint all tables that incude a \"search string\" in their name and have been analyzed after a specific date\n" putStrLn "Give me the search string: " search <- T.getLine putStrLn "Give me the date in \"DD/MM/YYYY\" format: " datestr <- Prelude.getLine -- print source RTable first n rows -- RTable A resultRTab <- runJulius $ julExpr search datestr $ toRTable src_DBTab_MData srcCSV printfRTable ( -- this is the equivalent when printing on the screen a list of columns, defined in a SELECT clause in SQL genRTupleFormat ["OWNER", "TABLE_NAME", "TABLESPACE_NAME", "STATUS", "NUM_ROWS", "BLOCKS", "LAST_ANALYZED"] genDefaultColFormatMap ) $ resultRTab -- save result to a CSV file writeCSV "./app/result-data.csv" $ fromRTable result_tab_MData resultRTab where julExpr srch dtstr rtab = EtlMapStart :-> (EtlR $ ROpStart :. (Filter (From $ Tab rtab) $ FilterBy (\t -> case instrRText (RText srch) (t "TABLE_NAME") of Just p -> True Nothing -> False && (t "LAST_ANALYZED") >= (RTime $ toRTimestamp "DD/MM/YYYY" dtstr) ) ) ) ``` ### Julius DSL Tutorial We have written [a Julius tutorial](https://github.com/nkarag/haskell-DBFunctor/blob/master/doc/JULIUS-TUTORIAL.md) to help you get started with Julius DSL. ### How to run #### 1. Install Stack See [this](https://docs.haskellstack.org/en/stable/GUIDE/) guide for help. If you have stack already installed, then we suggest you run a `stack upgrade`, in order to update it to the latest version and avoid any error messages due to bugs. Then run a `stack update`, in order to update the package index. #### 2. Initiate a project `$ stack new myDBFunctorProject` #### 3. Start coding with DBFunctor Don't forget: - to `import Etl.Julius` module - to use GHC extension `{-# LANGUAGE OverloadedStrings #-}`, since DBFunctor uses `Text` in all of its basic data types, this extension is necessary if you want to assign string literal values to an `RDataType` (the type of a column in an `RTable`) #### 4. Declare dependency with DBFunctor package Edit your package.yaml (or project.cabal) file and add the dependency to the DBFunctor package ``` dependencies: - DBFunctor ``` Also, don't forget to add to your stack.yaml file the line: ``` extra-deps: - DBFunctor-0.1.0.0 ``` #### 5. Build and run you project ``` $ stack build ``` Run ``` stack exec -- myDBFunctorProject-exe ```