karps-0.2.0.0: Haskell bindings for Spark Dataframes and Datasets

Safe HaskellNone
LanguageHaskell2010

Spark.Core.Functions

Contents

Synopsis

Creation

dataframe :: DataType -> [Cell] -> DataFrame Source #

Creates a dataframe from a list of cells and a datatype.

Wil fail if the content of the cells is not compatible with the data type.

Standard conversions

asLocalObservable :: ComputeNode LocLocal a -> LocalFrame Source #

Converts a local node to a local frame. This always works.

asDouble :: (Num a, SQLTypeable a) => LocalData a -> LocalData Double Source #

Casts a local data as a double.

Arithmetic operations

(.+) :: forall a1 a2. (Num a1, Num a2, GeneralizedHomo2 a1 a2) => a1 -> a2 -> GeneralizedHomoReturn a1 a2 Source #

A generalization of the addition for the Karps types.

(.-) :: forall a1 a2. (Num a1, Num a2, GeneralizedHomo2 a1 a2) => a1 -> a2 -> GeneralizedHomoReturn a1 a2 Source #

A generalization of the negation for the Karps types.

(./) :: (Fractional a1, Fractional a2, GeneralizedHomo2 a1 a2) => a1 -> a2 -> GeneralizedHomoReturn a1 a2 Source #

div' :: forall a1 a2. (Num a1, Num a2, GeneralizedHomo2 a1 a2) => a1 -> a2 -> GeneralizedHomoReturn a1 a2 Source #

Utilities

(@@) :: CanRename a txt => a -> txt -> a Source #

_1 :: FixedProjection1 Source #

_2 :: FixedProjection2 Source #

Standard library

collect :: forall ref a. SQLTypeable a => Column ref a -> LocalData [a] Source #

Collects all the elements of a column into a list.

NOTE: This list is sorted in the canonical ordering of the data type: however the data may be stored by Spark, the result will always be in the same order. This is a departure from Spark, which does not guarantee an ordering on the returned data.

collect' :: DynColumn -> LocalFrame Source #

See the documentation of collect.

count :: forall a. Dataset a -> LocalData Int Source #

The number of elements in a column.

identity :: ComputeNode loc a -> ComputeNode loc a Source #

The identity function.

Returns a compute node with the same datatype and the same content as the previous node. If the operation of the input has a side effect, this side side effect is *not* reevaluated.

This operation is typically used when establishing an ordering between some operations such as caching or side effects, along with logicalDependencies.

autocache :: Dataset a -> Dataset a Source #

Automatically caches the dataset on a need basis, and performs deallocation when the dataset is not required.

This function marks a dataset as eligible for the default caching level in Spark. The current implementation performs caching only if it can be established that the dataset is going to be involved in more than one shuffling or aggregation operation.

If the dataset has no observable child, no uncaching operation is added: the autocache operation is equivalent to unconditional caching.

cache :: Dataset a -> Dataset a Source #

Caches the dataset.

This function instructs Spark to cache a dataset with the default persistence level in Spark (MEMORY_AND_DISK).

Note that the dataset will have to be evaluated first for the caching to take effect, so it is usual to call count or other aggregrators to force the caching to occur.

uncache :: ComputeNode loc a -> ComputeNode loc a Source #

Uncaches the dataset.

This function instructs Spark to unmark the dataset as cached. The disk and the memory used by Spark in the future.

Unlike Spark, Karps is stricter with the uncaching operation: - the argument of cache must be a cached dataset - once a dataset is uncached, its cached version cannot be used again (i.e. it must be recomputed).

Karps performs escape analysis and will refuse to run programs with caching issues.

joinInner :: Column ref1 key -> Column ref1 value1 -> Column ref2 key -> Column ref2 value2 -> Dataset (key, value1, value2) Source #

Explicit inner join.

joinInner' :: DynColumn -> DynColumn -> DynColumn -> DynColumn -> DataFrame Source #

Untyped version of the inner join.

broadcastPair :: Dataset a -> LocalData b -> Dataset (a, b) Source #

Low-level operator that takes an observable and propagates it along the content of an existing dataset.

Users are advised to use the Column-based broadcast function instead.