hadoop-streaming: A simple Hadoop streaming library

[ bsd3, cloud, distributed-computing, library, mapreduce ] [ Propose Tags ] [ Report a vulnerability ]

A simple Hadoop streaming library based on conduit, useful for writing mapper and reducer logic in Haskell and running it on AWS Elastic MapReduce, Azure HDInsight, GCP Dataproc, and so forth.

[Skip to Readme]

Modules

[Index] [Quick Jump]

HadoopStreaming
- HadoopStreaming.ByteString
- HadoopStreaming.Text

Downloads

hadoop-streaming-0.2.0.3.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

zliu41

For package maintainers and hackage trustees

edit package information

Candidates

0.1.0.0, 0.2.0.0, 0.2.0.1, 0.2.0.2, 0.2.0.3

Versions [RSS]	0.1.0.0, 0.2.0.0, 0.2.0.1, 0.2.0.2, 0.2.0.3
Change log	CHANGELOG.md
Dependencies	base (>=4.12 && <5), bytestring (>=0.10 && <0.11), conduit (>=1.3.1 && <1.4), extra (>=1.6.18 && <1.8), text (>=1.2.2.0 && <1.3) [details]
Tested with	ghc ==8.8.3, ghc ==8.6.5
License	BSD-3-Clause
Copyright	2020 Ziyang Liu
Author	Ziyang Liu <free@cofree.io>
Maintainer	Ziyang Liu <free@cofree.io>
Uploaded	by zliu41 at 2020-05-18T15:34:43Z
Category	Cloud, Distributed Computing, MapReduce
Home page	https://github.com/zliu41/hadoop-streaming
Bug tracker	https://github.com/zliu41/hadoop-streaming/issues
Source repo	head: git clone https://github.com/zliu41/hadoop-streaming
Distributions
Downloads	1296 total (8 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2020-05-18 [all 1 reports]

Readme for hadoop-streaming-0.2.0.3

[back to package description]

A simple Hadoop streaming library based on conduit, useful for writing mapper and reducer logic in Haskell and running it on AWS Elastic MapReduce, Azure HDInsight, GCP Dataproc, and so forth.

Hackage: https://hackage.haskell.org/package/hadoop-streaming

Word Count Example

See the Haddock in HadoopStreaming.Text for a simple word-count example.

A Few Things to Note

ByteString vs Text

The HadoopStreaming module provides the general Mapper and Reducer data types, whose input and output types are abstract. They are usually instantiated with either ByteString or Text. ByteString is more suitable if the input/output needs to be decoded/encoded, for instance using the base64-bytestring library. On the other hand, Text could make more sense if decoding/encoding is not needed, or if the data is not UTF-8 encoded (see below regarding encodings). In general I'd imagine ByteString being used much more often than Text.

The HadoopStreaming.ByteString and HadoopStreaming.Text modules provide some utilities for working with ByteString and Text, respectively.

Encoding

It is highly recommended that your input data be UTF-8 encoded, as this is the default encoding Hadoop uses. If you must use other encodings such as UTF-16, keep in mind the following gotchas:

It is not enough that your code can work with the encoding you choose to use:
- By default, if any of your input files does not end with a UTF-8 representation of newline, i.e., a 0x0A byte, Hadoop streaming will add a 0x0A byte.
- Likewise, if any line in your mapper output does not contain a UTF-8 representation of tab (0x09), Hadoop streaming will add it at the end of the line.
This will almost certainly break your job. It may be possible to configure Hadoop streaming and tell it to use other encodings, so that the above behavior is consistent with the encoding you choose to use, but I don't know whether that is the case. I tried -D mapreduce.map.java.opts="-Dfile.encoding=UTF-16BE" but that doesn't seem to work.
If you use ByteString as the input type and use Data.ByteString.hGetLine to read lines from the input, be aware that Data.ByteString.hGetLine uses 0x0A bytes as line breaks, so it doesn't work properly for non-UTF-8 encoded input. For example, in UTF-16BE and UTF-16LE, the newline character is encoded as 0x00 0x0A and 0x0A 0x00, respectively.